Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R Group By and Sum to Ignore NA

I have a data frame like this:

library(dplyr)

name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)

my_df <- data.frame(cbind(name, count, score)) %>%
  mutate(count = as.numeric(count),
         score = as.numeric(score))
my_df

  name count score
1  Bob     2     5
2  Bob     3    NA
3  Bob     4    NA
4  Bob     5    NA
5 John     2     3
6 John     3     4
7 John     4     2

Then I create another column by taking the product between count and score:

my_df %>%
  mutate(product = count*score) 

  name count score product
1  Bob     2     5      10
2  Bob     3    NA      NA
3  Bob     4    NA      NA
4  Bob     5    NA      NA
5 John     2     3       6
6 John     3     4      12
7 John     4     2       8

I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob’s result is calculated as 10/(2+3+4+5) = 0.71 but I want Bob’s result to be 10/2 = 5.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

my_df %>%
  mutate(product = count*score)
  group_by(name) %>%
  summarize(result = sum(product, na.rm = TRUE)/sum(count))

  name  result
  <chr>  <dbl>
1 Bob    0.714
2 John   2.89

>Solution :

We may need to subset the count by the non-NA values in ‘product’

library(dplyr)
my_df %>%
   mutate(product = count*score) %>% 
   group_by(name) %>% 
   summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))

-output

# A tibble: 2 × 2
  name  result
  <chr>  <dbl>
1 Bob     5   
2 John    2.89

Or do a filter before the grouping

my_df %>% 
   filter(complete.cases(score)) %>% 
   group_by(name) %>% 
   summarise(result = sum(score * count)/sum(count))
# A tibble: 2 × 2
  name  result
  <chr>  <dbl>
1 Bob     5   
2 John    2.89
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading