Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What is the simplest way to compute the average of one variable grouped by a second variable, iterating over all second variables dplyr?

I have a data frame with a large number of variables, one of them, the probability of death to be predicted by all others.
As a preliminary step I want to compute the PoD by counting the death rate in bins of each variable.

let’s say df <- (age = c(25, 57, 60), weight = (80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))

Then I can group by age (say under 50 and over 50) and compute the PoD as the death rate of one group as the count of death_flags divided by the number of people falling into the group, or simply the average death_flag. When grouping by weight(say below and above 80) I will obtain a different death rate and thus a different PoD, for each binned variable, which is what I want. My problem arises when trying to iterate through all variables.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

So far I’ve tried variants of the following piece of code, which however does not work:

for(n in names(df)) {

    df%>% group_by(n)%>%
      summarise(PoD_bin = mean(death_flag))
}

I haven’t figured out a way to run through all variables and perform the computation.

As a side note, the binning of variables I have done without dplyr by:

for(v in names(df[-1])){
    newVar <- paste(f, "bin", sep = "_")
    df[newVar] <- cut(as.matrix(df[v]), breaks = 100)
}

I am irritated, that I cannot refer to the variables in the first for loop for the grouping, while I can do so in the second to create new columns of the df.

Help is greatly appreciated!

>Solution :

Your loop doesn’t work because a character is parsed to group_by. You could modify your loop a little bit and get the desired result. I have added print() to see the output.

for (n in names(df)) {
  
  df |>
    group_by(!!sym(n)) |>
    summarise(PoD_bin = mean(death_flag)) |>
    print()
  
}

Output:

# A tibble: 3 × 2
    age PoD_bin
  <dbl>   <dbl>
1    25       1
2    57       0
3    60       1
# A tibble: 3 × 2
  weight PoD_bin
   <dbl>   <dbl>
1     61       1
2     80       1
3     92       0
# A tibble: 3 × 2
  cigarettes_a_day PoD_bin
             <dbl>   <dbl>
1                2       0
2               19       1
3               30       1
# A tibble: 2 × 2
  death_flag PoD_bin
       <dbl>   <dbl>
1          0       0
2          1       1

Data:

df <- tibble(age = c(25, 57, 60), weight = c(80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading