Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

summarise with dplyr and a condition

I have the following data frame:

df <- read.table(text = "group  age status
A   25  yes
A   32  no
A   58  yes
A   78  no
A   76  yes
B   21  no
B   71  yes
B   43  no
B   48  no
C   39  no
C   82  yes
C   87  no
C   91  yes", header = T)

I want to group_by group column and then summarise in a way where if the age is less than 50 and status is "yes", count those values and then count total number of values for which age is less than 50.
So for ‘A’ in the data frame:

age_lt_50_yes = 1
age_lt_50 = 2

Similarly for age greater than 50 with yes and then age greater than 50;
For ‘A’; this will be :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

age_gt_50_yes = 2
age_gt_50 = 3

I actually want to have a ratio of age_lt_50_yes/age_lt_50 and for ‘A’ this will be 1/2 and for age_gt_50_yes/age_gt_50 = 2/3 for ‘A’

If by any means I land into a situation where I have something like this 0/0; then just want 0 as an output.

How can I do this using dplyr?

Here is what I have tried:

df %>% 
  group_by(group) %>% 
  summarize(age_le50_prop = sum(age <= 50) / n(),
            age_gt50_prop = sum(age > 50) / n())

I need to add status in the summarize function

My output should be like this:

group_by age_lt_50  age_gr_50
A   0.5 0.66
B   0   1
C   0   0.66

>Solution :

You can use reframe(), and simply define the counts you need. The final line will take care of the case where you had 0/0

reframe(
  df, 
  age_lt_50 = sum(age<50 & status=="yes")/sum(age<50),
  age_gt_50 = sum(age>=50 & status=="yes")/sum(age>=50),
  .by=group) %>% replace(is.na(.),0)

Output:

  group age_lt_50 age_gt_50
1     A       0.5 0.6666667
2     B       0.0 1.0000000
3     C       0.0 0.6666667

Here is an alternative that demos the use of summarize() instead of reframe, and (independently) also demonstrates another way to check for that possibility of 0 in the denominator:

df %>% 
  group_by(group) %>% 
  summarize(
    age_lt_50 = {if(sum(age<50)==0) 0 else sum(age<50 & status=="yes")/sum(age<50)},
    age_gt_50 = {if(sum(age>=50)==0) 0 else sum(age>=50 & status=="yes")/sum(age>=50)}
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading