Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Mean and sd per group for multiple variables when NAs present

I would like to create a table of mean and sd for multiple variables for grouped data. However, the data has NAs, so I need to include the na.rm =T command.

Using iris as a MWE, altered to include NAs:

irisalt = iris
irisalt[1,1] =NA
irisalt[52,2] =NA
irisalt[103,3]= NA

First attempt:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

 Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd
  <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>
1 setosa        50             NA             NA                 3.43          0.379              1.46           0.174
2 versicolor    50              5.94           0.516            NA            NA                  4.26           0.470
3 virginica     50              6.59           0.636             2.97          0.322             NA             NA    

This is the table I need, but I want to calculate the means and sd by removing the NA.

Second attempt:

irisalt%>%
  group_by(Species)%>%
  drop_na()%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

This removes the whole rows where there are NA, and so alters the means for variables where there is data.

3rd attempt:

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Error in `summarise()`:
i In argument: `across(...)`.
Caused by error in `is.data.frame()`:
! 'list' object cannot be coerced to type 'double'

Fourth attempt:

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), ~c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Species    count Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <int>        <dbl>       <dbl>        <dbl>       <dbl>
1 setosa        50        5.00        3.43         1.46        0.246
2 setosa        50        0.356       0.379        0.174       0.105
3 versicolor    50        5.94        2.76         4.26        1.33 
4 versicolor    50        0.516       0.311        0.470       0.198
5 virginica     50        6.59        2.97         5.54        2.03 
6 virginica     50        0.636       0.322        0.555       0.275
Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust
  accordingly.

These are the numbers I need, but I need a single row per group (Species), with a separate column for mean and sd for each variable, like in my first attempt

>Solution :

Looks like you are close, but your syntax is a bit off:

library(dplyr)

irisalt %>%
  group_by(Species) %>%
  summarise(count = n(),
            across(contains("."), list(mean = ~ mean(., na.rm = T), 
                                       sd = ~ sd(., na.rm =T)))
  )

From the documentation ?across

A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)). Each function is applied to each column, and the output is named by combining the function name and the column name using the glue specification in .names.

Note: as of dplyr 1.1.0 summarize now has a .by argument that is experimental but allows for one-time group computations like you have here. So you do not need to pipe to group_by.

Output

 Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd Petal.Width_mean Petal.Width_sd
  <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>            <dbl>          <dbl>
1 setosa        50              5.00           0.356             3.43          0.379              1.46           0.174            0.246          0.105
2 versicolor    50              5.94           0.516             2.76          0.311              4.26           0.470            1.33           0.198
3 virginica     50              6.59           0.636             2.97          0.322              5.54           0.555            2.03           0.275
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading