I would like to use the ggupset package to create an upset plot but I am struggling to format my data correctly. My data is currently in a tibble similar to the one below.
> tibble
# A tibble: 13 Ă— 3
locus pathway fold_change
<chr> <chr> <dbl>
1 0001 A 1
2 0001 B 1
3 0001 C 1
4 0001 D 1
5 0002 B -2
6 0002 D -2
7 0003 C 1
8 0004 C 3
9 0004 E 3
10 0004 F 3
11 0004 G 3
12 0004 H 3
13 0005 D 2.5
ggupset requires a format in which the pathway column would need to be formatted as a list for each locus observation as in the fake tibble below (the correct formatting is also shown in the tidy_movies dataset in ggplot2).
>fake_tibble
# A tibble: 5 x 3
locus pathways fold_change
<chr> <list> <dbl>
1 0001 "A" "B" "C" "D" 1
2 0002 "B" "D" -2
3 0003 "C" 1
4 0004 "C" "E" "F" "G" "H" 3
5 0005 "D" 2.5
The real dataset is too large for me to want to work through manually creating a list for each locus so any help wrangling this data would be appreciated.
>Solution :
Use summarise with list.
df %>%
group_by(locus, fold_change) %>%
summarise(pathway = list(pathway))
locus fold_change pathway
<int> <dbl> <list>
1 1 1 <chr [4]>
2 2 -2 <chr [2]>
3 3 1 <chr [1]>
4 4 3 <chr [5]>
5 5 2.5 <chr [1]>
data
df <- structure(list(locus = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L,
4L, 4L, 4L, 5L), pathway = c("A", "B", "C", "D", "B", "D", "C",
"C", "E", "F", "G", "H", "D"), fold_change = c(1, 1, 1, 1, -2,
-2, 1, 3, 3, 3, 3, 3, 2.5)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"
))