How to aggregate a data frame based on the max value of the group in R

March 16, 2022

I have a large data with many groups that looks like this.
I want in each group to use the fruit with the most counts as the central fruit,
and aggregate the other fruits based on it!

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
             col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             counts_col1 = c(100,100,2,200,200,2),
             counts_col2 = c(2,50,50,2,20,20),
             id=c(1,1,1,2,2,2))

df
#> # A tibble: 6 × 5
#>   col1    col2    counts_col1 counts_col2    id
#>   <chr>   <chr>         <dbl>       <dbl> <dbl>
#> 1 apple   pple            100           2     1
#> 2 apple   app             100          50     1
#> 3 pple    app               2          50     1
#> 4 banana  bananna         200           2     2
#> 5 banana  banan           200          20     2
#> 6 bananna banan             2          20     2

^{Created on 2022-03-16 by the reprex package (v2.0.1)}

I want my data frame to look like this

id  central_fruit   fruits                 counts     sum_counts
 1     apple        apple,pple,app         100,50,2        152
 2    banana        banana,bananna,banan   200,20,2        222

The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters.
Any help or guidance is appreciated

>Solution :

We may do this by first reshaping to ‘long’ format (pivot_longer), grouped by ‘id’, ‘grp’, create a frequency count (add_count), then summarise the ‘central_fruit’ which had the max frequency by ‘id’, and similarly paste (toString) the unique fruit, and unique count along with sum of unique count

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   rename_with(~ str_c("fruit_", .x), starts_with('col')) %>% 
   pivot_longer(cols = -id, names_to = c(".value", "grp"), 
     names_pattern = "(.*)_(col\\d+)") %>% 
   group_by(id, grp) %>%
   add_count(fruit) %>%
   group_by(id) %>% 
   summarise(central_fruit = fruit[which.max(n)], 
      fruits = toString(unique(fruit)), 
      sum_counts = sum(unique(counts)),
      counts = toString(sort(unique(counts), decreasing = TRUE)),
        .groups = 'drop' ) %>%
     relocate(counts, .before = 'sum_counts')

-output

# A tibble: 2 × 5
     id central_fruit fruits                 counts     sum_counts
  <dbl> <chr>         <chr>                  <chr>           <dbl>
1     1 apple         apple, pple, app       100, 50, 2        152
2     2 banana        banana, bananna, banan 200, 20, 2        222

NOTE: It may be better to wrap the values of ‘counts’ in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be
counts = list(sort(unique(counts), decreasing = TRUE))