I have a large data with many groups that looks like this.
I want in each group to use the fruit with the most counts as the central fruit,
and aggregate the other fruits based on it!
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20),
id=c(1,1,1,2,2,2))
df
#> # A tibble: 6 × 5
#> col1 col2 counts_col1 counts_col2 id
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 100 2 1
#> 2 apple app 100 50 1
#> 3 pple app 2 50 1
#> 4 banana bananna 200 2 2
#> 5 banana banan 200 20 2
#> 6 bananna banan 2 20 2
Created on 2022-03-16 by the reprex package (v2.0.1)
I want my data frame to look like this
id central_fruit fruits counts sum_counts
1 apple apple,pple,app 100,50,2 152
2 banana banana,bananna,banan 200,20,2 222
The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters.
Any help or guidance is appreciated
>Solution :
We may do this by first reshaping to ‘long’ format (pivot_longer), grouped by ‘id’, ‘grp’, create a frequency count (add_count), then summarise the ‘central_fruit’ which had the max frequency by ‘id’, and similarly paste (toString) the unique fruit, and unique count along with sum of unique count
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_c("fruit_", .x), starts_with('col')) %>%
pivot_longer(cols = -id, names_to = c(".value", "grp"),
names_pattern = "(.*)_(col\\d+)") %>%
group_by(id, grp) %>%
add_count(fruit) %>%
group_by(id) %>%
summarise(central_fruit = fruit[which.max(n)],
fruits = toString(unique(fruit)),
sum_counts = sum(unique(counts)),
counts = toString(sort(unique(counts), decreasing = TRUE)),
.groups = 'drop' ) %>%
relocate(counts, .before = 'sum_counts')
-output
# A tibble: 2 × 5
id central_fruit fruits counts sum_counts
<dbl> <chr> <chr> <chr> <dbl>
1 1 apple apple, pple, app 100, 50, 2 152
2 2 banana banana, bananna, banan 200, 20, 2 222
NOTE: It may be better to wrap the values of ‘counts’ in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be
counts = list(sort(unique(counts), decreasing = TRUE))