R Group By and Sum to Ignore NA

September 6, 2022

I have a data frame like this:

library(dplyr)

name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)

my_df <- data.frame(cbind(name, count, score)) %>%
  mutate(count = as.numeric(count),
         score = as.numeric(score))
my_df

  name count score
1  Bob     2     5
2  Bob     3    NA
3  Bob     4    NA
4  Bob     5    NA
5 John     2     3
6 John     3     4
7 John     4     2

Then I create another column by taking the product between count and score:

my_df %>%
  mutate(product = count*score) 

  name count score product
1  Bob     2     5      10
2  Bob     3    NA      NA
3  Bob     4    NA      NA
4  Bob     5    NA      NA
5 John     2     3       6
6 John     3     4      12
7 John     4     2       8

I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob’s result is calculated as 10/(2+3+4+5) = 0.71 but I want Bob’s result to be 10/2 = 5.

my_df %>%
  mutate(product = count*score)
  group_by(name) %>%
  summarize(result = sum(product, na.rm = TRUE)/sum(count))

  name  result
  <chr>  <dbl>
1 Bob    0.714
2 John   2.89

>Solution :

We may need to subset the count by the non-NA values in ‘product’

library(dplyr)
my_df %>%
   mutate(product = count*score) %>% 
   group_by(name) %>% 
   summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))

-output

# A tibble: 2 × 2
  name  result
  <chr>  <dbl>
1 Bob     5   
2 John    2.89

Or do a filter before the grouping

my_df %>% 
   filter(complete.cases(score)) %>% 
   group_by(name) %>% 
   summarise(result = sum(score * count)/sum(count))
# A tibble: 2 × 2
  name  result
  <chr>  <dbl>
1 Bob     5   
2 John    2.89