Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to aggregate a data frame based on the max value of the group in R

I have a large data with many groups that looks like this.
I want in each group to use the fruit with the most counts as the central fruit,
and aggregate the other fruits based on it!

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
             col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             counts_col1 = c(100,100,2,200,200,2),
             counts_col2 = c(2,50,50,2,20,20),
             id=c(1,1,1,2,2,2))

df
#> # A tibble: 6 × 5
#>   col1    col2    counts_col1 counts_col2    id
#>   <chr>   <chr>         <dbl>       <dbl> <dbl>
#> 1 apple   pple            100           2     1
#> 2 apple   app             100          50     1
#> 3 pple    app               2          50     1
#> 4 banana  bananna         200           2     2
#> 5 banana  banan           200          20     2
#> 6 bananna banan             2          20     2

Created on 2022-03-16 by the reprex package (v2.0.1)

I want my data frame to look like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

id  central_fruit   fruits                 counts     sum_counts
 1     apple        apple,pple,app         100,50,2        152
 2    banana        banana,bananna,banan   200,20,2        222

The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters.
Any help or guidance is appreciated

>Solution :

We may do this by first reshaping to ‘long’ format (pivot_longer), grouped by ‘id’, ‘grp’, create a frequency count (add_count), then summarise the ‘central_fruit’ which had the max frequency by ‘id’, and similarly paste (toString) the unique fruit, and unique count along with sum of unique count

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   rename_with(~ str_c("fruit_", .x), starts_with('col')) %>% 
   pivot_longer(cols = -id, names_to = c(".value", "grp"), 
     names_pattern = "(.*)_(col\\d+)") %>% 
   group_by(id, grp) %>%
   add_count(fruit) %>%
   group_by(id) %>% 
   summarise(central_fruit = fruit[which.max(n)], 
      fruits = toString(unique(fruit)), 
      sum_counts = sum(unique(counts)),
      counts = toString(sort(unique(counts), decreasing = TRUE)),
        .groups = 'drop' ) %>%
     relocate(counts, .before = 'sum_counts')

-output

# A tibble: 2 × 5
     id central_fruit fruits                 counts     sum_counts
  <dbl> <chr>         <chr>                  <chr>           <dbl>
1     1 apple         apple, pple, app       100, 50, 2        152
2     2 banana        banana, bananna, banan 200, 20, 2        222

NOTE: It may be better to wrap the values of ‘counts’ in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be
counts = list(sort(unique(counts), decreasing = TRUE))

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading