Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Group by unique ids to calculate number of days in R

i want to find how many days a unique id has worked in this df by abstracting the max date minus min date.

structure(list(id = c("f1", "f2", "f2", "f4", "f5", "f2", "f3", 
"f1", "f1", "f3", "f4", "f2", "f2", "f2", "f2"), 
 date = structure(c(18687, 18687, 18687, 18687, 18687, 
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687, 
18687), class = "Date")), row.names = c(2L, 4L, 8L, 15L, 17L, 
18L, 21L, 25L, 36L, 37L, 38L, 40L, 42L, 48L, 52L), class = "data.frame")

I have tried this but I get weird numbers

df_total_days_per_id<-df %>%
  group_by(id) %>%
  mutate(xx1 = max(date)-min(date)) %>% #by user find range
  group_by(id) %>%
  summarise(number_of_days = sum(xx1)) 

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Based on your code I can make an estimate of why you are getting "weird numbers".

The function mutate() adds a new column to the existing table. Let’s forget about groups for a moment and assume the following dates:

2022-01-01 
2022-01-11 
2022-01-21

You want the answer to be 20, right? 21 minus 1. However with your mutate() function you create this:

2022-01-01   20 days
2022-01-11   20 days
2022-01-21   20 days

and then in the last operation you sum all these 20s to reach 60 days. Obviously this is wrong.

Instead of using mutate() we use summarize() like so:

df %>%
  group_by(id) %>%
  summarize(number_of_days = max(date)-min(date)+1) #added +1 for correction 

after this there is no need to add another group_by() or sum()

Edit:

I noticed that you are talking about the number of days worked. So if we take the example above you might instead want to output 3 instead of 20 because there are days inbetween that were skipped. If this is the case you should use a different function in your summarize() altogether.

For this we can use the function n_distinct() which counts the amount of unique values in a group, like so:

df %>%
  group_by(id) %>%
  summarise(number_of_days = n_distinct(date)) 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading