Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to count distinct values for multiple columns in R for a grouped data frame

I have a data frame like this:

v_1 <- c("1a", "1b","1c", "2a", "2b", "2c", "3a", "3b","3c", "4a", "4b", "4c")
v_2 <- c(1,1,1,2,2,2,3,3,3,4,4,4)
v_3 <- c("dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat", "cat")
v_4 <- c(1:12)

df <- data.frame(v_1, v_2, v_3, v_4)
df
   v_1 v_2 v_3 v_4
1   1a   1 dog   1
2   1b   1 dog   2
3   1c   1 dog   3
4   2a   2 dog   4
5   2b   2 dog   5
6   2c   2 dog   6
7   3a   3 cat   7
8   3b   3 cat   8
9   3c   3 cat   9
10  4a   4 cat  10
11  4b   4 cat  11
12  4c   4 cat  12

I want to group this data frame and count the distinct values for v_1 and v_2. If I am just interested in the count in v_1 it is quite easy. I just do:

library(dplyr)

df_grouped <- df %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

  v_3   v_4_sum v_1_count
  <chr>   <int>     <int>
1 cat        57         6
2 dog        21         6

If I want to also see te count of v_2 it seems like I have to use group_by two times like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df_grouped_v2 <- df %>% 
  group_by(v_2, v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

df_grouped_v22 <- df_grouped_v2 %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4_sum), 
            v_1_count = sum(v_1_count), 
            v_2_count = n())

df_grouped_v22

     v_3   v_4_sum v_1_count v_2_count
      <chr>   <int>     <int>     <int>
    1 cat        57         6         2
    2 dog        21         6         2

This is the result I want but it seems not straight forward. Especially if I have a huge data frame the group_by operation is time intensive and I would prefer to use it just one time.

>Solution :

For distinct values, you can use n_distinct() rather than n():

library(dplyr)

df |>
  summarise(v_4_sum = sum(v_4),
            across(c(v_1, v_2), n_distinct, .names = "{.col}_count"),
            .by = v_3)

  v_3 v_4_sum v_1_count v_2_count
1 dog      21         6         2
2 cat      57         6         2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading