How can I "weight" data points before making a density plot in R?

Advertisements

Let’s say I have some data in a tibble activity with a column activity$time that records the time of day of some events. Suppose this data consists of two different sampling periods, one from 5:00 to 9:00, and one from 7:00 to 11:00. Because these periods overlap, events between 7:00 and 9:00 are over-represented by a factor of 2 compared to the rest. If I were to make a density plot like this:

ggplot(activity) + geom_density(mapping = aes(x = time))

then the center would be skewed upwards compared to what would be a true reflection of reality. How can I tell geom_density() something like "weight this interval by a factor of 0.5", or better yet, provide an arbitrary weighting function?

Here is some code demonstrating the overlap effect. runif() should produce a uniform distribution, but because I have two overlapping sections, there is a higher plateau in the middle:

set.seed(27036459)
activity <- tibble(time = c(runif(10000, 5, 9), runif(10000, 7, 11)))
ggplot(activity) + geom_density(mapping = aes(x = time))

What I want is a way to take activity, and using my knowledge of the sampling intervals, somehow adjust the graph to represent the actual distribution of the phenomenon, independent of sampling bias (in this case, the uniformity of runif()).

>Solution :

We can produce a set-up similar to your own by taking 50 samples from the period 5am to 9am and another 50 samples from 7am to 11am like so:

set.seed(1)

activity <- data.frame(time = as.POSIXct("2022-08-05 05:00:00") +
                         c(runif(50, 0, 14400), c(runif(50, 7200, 21600))))

And we can see this produces the unwanted peak between 7am and 9am:

library(tidyverse)

ggplot(activity) + 
  geom_density(mapping = aes(x = time))

There is no weights argument in geom_density, but since the area under the curve is normalized to one, it doesn’t matter whether we half the weight of values between 7 and 9, or double the weights outside this period – it would give us the same result. The latter is much easier to do however: we just create a copy of the data frame in which we filter out the values between 7 and 9, then bind this to the original data frame:

library(lubridate)

activity %>%
  filter(hour(time) < 7 | hour(time) > 9) %>%
  bind_rows(activity) %>%
  ggplot() +
  geom_density(mapping = aes(x = time))

Created on 2022-08-05 by the reprex package (v2.0.1)

Leave a ReplyCancel reply