summarise with dplyr and a condition

May 5, 2023

I have the following data frame:

df <- read.table(text = "group  age status
A   25  yes
A   32  no
A   58  yes
A   78  no
A   76  yes
B   21  no
B   71  yes
B   43  no
B   48  no
C   39  no
C   82  yes
C   87  no
C   91  yes", header = T)

I want to group_by group column and then summarise in a way where if the age is less than 50 and status is "yes", count those values and then count total number of values for which age is less than 50.
So for ‘A’ in the data frame:

age_lt_50_yes = 1
age_lt_50 = 2

Similarly for age greater than 50 with yes and then age greater than 50;
For ‘A’; this will be :

age_gt_50_yes = 2
age_gt_50 = 3

I actually want to have a ratio of age_lt_50_yes/age_lt_50 and for ‘A’ this will be 1/2 and for age_gt_50_yes/age_gt_50 = 2/3 for ‘A’

If by any means I land into a situation where I have something like this 0/0; then just want 0 as an output.

How can I do this using dplyr?

Here is what I have tried:

df %>% 
  group_by(group) %>% 
  summarize(age_le50_prop = sum(age <= 50) / n(),
            age_gt50_prop = sum(age > 50) / n())

I need to add status in the summarize function

My output should be like this:

group_by age_lt_50  age_gr_50
A   0.5 0.66
B   0   1
C   0   0.66

>Solution :

You can use reframe(), and simply define the counts you need. The final line will take care of the case where you had 0/0

reframe(
  df, 
  age_lt_50 = sum(age<50 & status=="yes")/sum(age<50),
  age_gt_50 = sum(age>=50 & status=="yes")/sum(age>=50),
  .by=group) %>% replace(is.na(.),0)

Output:

  group age_lt_50 age_gt_50
1     A       0.5 0.6666667
2     B       0.0 1.0000000
3     C       0.0 0.6666667

Here is an alternative that demos the use of summarize() instead of reframe, and (independently) also demonstrates another way to check for that possibility of 0 in the denominator:

df %>% 
  group_by(group) %>% 
  summarize(
    age_lt_50 = {if(sum(age<50)==0) 0 else sum(age<50 & status=="yes")/sum(age<50)},
    age_gt_50 = {if(sum(age>=50)==0) 0 else sum(age>=50 & status=="yes")/sum(age>=50)}
)