Home How to count distinct values for multiple columns in R for a grouped data frame

Questions

How to count distinct values for multiple columns in R for a grouped data frame

March 17, 2023

I have a data frame like this:

v_1 <- c("1a", "1b","1c", "2a", "2b", "2c", "3a", "3b","3c", "4a", "4b", "4c")
v_2 <- c(1,1,1,2,2,2,3,3,3,4,4,4)
v_3 <- c("dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat", "cat")
v_4 <- c(1:12)

df <- data.frame(v_1, v_2, v_3, v_4)
df
   v_1 v_2 v_3 v_4
1   1a   1 dog   1
2   1b   1 dog   2
3   1c   1 dog   3
4   2a   2 dog   4
5   2b   2 dog   5
6   2c   2 dog   6
7   3a   3 cat   7
8   3b   3 cat   8
9   3c   3 cat   9
10  4a   4 cat  10
11  4b   4 cat  11
12  4c   4 cat  12

I want to group this data frame and count the distinct values for v_1 and v_2. If I am just interested in the count in v_1 it is quite easy. I just do:

library(dplyr)

df_grouped <- df %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

  v_3   v_4_sum v_1_count
  <chr>   <int>     <int>
1 cat        57         6
2 dog        21         6

If I want to also see te count of v_2 it seems like I have to use group_by two times like this:

df_grouped_v2 <- df %>% 
  group_by(v_2, v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

df_grouped_v22 <- df_grouped_v2 %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4_sum), 
            v_1_count = sum(v_1_count), 
            v_2_count = n())

df_grouped_v22

     v_3   v_4_sum v_1_count v_2_count
      <chr>   <int>     <int>     <int>
    1 cat        57         6         2
    2 dog        21         6         2

This is the result I want but it seems not straight forward. Especially if I have a huge data frame the group_by operation is time intensive and I would prefer to use it just one time.

>Solution :

For distinct values, you can use n_distinct() rather than n():

library(dplyr)

df |>
  summarise(v_4_sum = sum(v_4),
            across(c(v_1, v_2), n_distinct, .names = "{.col}_count"),
            .by = v_3)

  v_3 v_4_sum v_1_count v_2_count
1 dog      21         6         2
2 cat      57         6         2