I have a large dataframe (Merged) of several thousand lines and 20 columns. Each line is a record of a site visit. There are 400 sites in the dataframe and each has multiple lines for visits on different months. . Example:
| Site | Date | Method | Result | ChangeDate |
|---|---|---|---|---|
| 2200 | 01/07/2021 | 24 | 0.22 | 01/07/2021 |
| 2201 | 03/07/2021 | 21 | 0.01 | 01/01/2021 |
| 2203 | 03/07/2021 | 21 | 0.02 | 01/01/2021 |
| 2200 | 01/08/2021 | 21 | 0.05 | 01/07/2021 |
| 2201 | 03/08/2021 | 21 | 0.01 | 01/01/2021 |
| 2203 | 03/08/2021 | 21 | 0.02 | 01/01/2021 |
| 2200 | 01/09/2021 | 21 | 0.05 | 01/07/2021 |
And so on for all 400 sites over two years.
I need to calculate the mean result for each site, which I can do:
OPmean <- aggregate(Merged$Result, list(Merged$Site), FUN=mean, na.rm = TRUE
But I only want to take the means after the change date. So on site 2200 I would want to take the mean on these lines only:
| Site | Date | Method | Result | ChangeDate |
|---|---|---|---|---|
| 2200 | 01/08/2021 | 21 | 0.05 | 01/07/2021 |
| 2200 | 01/09/2021 | 21 | 0.05 | 01/07/2021 |
Leaving the first line for 2200 off because the date occurs on (or before) the change date.
For an output I would want a new dataframe something like this:
| Site | Mean Result |
|---|---|
| 2200 | 0.05 |
| 2201 | 0.01 |
| 2203 | 0.02 |
>Solution :
@user2974951 gives a great base R answer. A tidyverse approach might be
library(dplyr)
Merged %>%
group_by(Site) %>%
filter(Date > ChangeDate) %>%
summarize(mr = mean(Result) # or optionally, mean(Result, na.rm = TRUE)