Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using summarize_all() to get a count of values in multiple columns

I want to create a table that counts up the number of non-NA values in each column of a dataset. I’m using summarize_all(), but I’m struggling to get a count of non-NA values.

I referred to this StackOverflow thread to get some insight, but it didn’t seem to help me obtain a count: summarize_all with "n()" function.
I’ll either get a count of distinct values if I pass summarize_all(n_distinct) or I’ll get the number of rows if I pass summarize_all(list(n=~n())) or summarize_all(list(n="length") as the thread cites.

My desired output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID Female  Male Non_Binary
 5      5     4          3

Where am I going wrong?

# Sample Code

test<-as_tibble(data.frame(`ID` = c("1","2","3","4","5"),
                           `Female` = c("Female","Female","Female","Female","Female"),
                           `Male` = c(NA,"Male","Male","Male","Male"),
                           `Non_Binary`=c("Non-Binary","Non-Binary","Non-Binary",NA,NA)))

## Attempt 1

summary<-test%>%
  summarize_all(list(n=~n()))

# A tibble: 1 × 4
   ID_n Female_n Male_n Non_Binary_n
  <int>    <int>  <int>        <int>
1     5        5      5            5

## Attempt 2

summary<-test%>%
  summarize_all(list(n="length"))

# A tibble: 1 × 4
   ID_n Female_n Male_n Non_Binary_n
  <int>    <int>  <int>        <int>
1     5        5      5            5

## Attempt 3

summary<-test%>%
  summarize_all(n_distinct)

# A tibble: 1 × 4
     ID Female  Male Non_Binary
  <int>  <int> <int>      <int>
1     5      1     2          2

### Desired Output

ID Female  Male Non_Binary
 5      5     4          3

>Solution :

n() and length() are poor choices for this – they don’t omit NA values from what they count. The classic way to count values that meet a conditions (such as "not NA") is to sum the condition.

summarize_all() has also been deprecated for a few years. Now it is preferred to use across().

test |>
  summarize(across(everything(), \(x) sum(!is.na(x))))

# # A tibble: 1 × 4
#      ID Female  Male Non_Binary
#   <int>  <int> <int>      <int>
# 1     5      5     4          3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading