Say I have a dataframe with families and I have the read counts for each sample. I want to return another dataframe that just shows me the sample in column 1, and which family is likely the biggest contributor to the total read count in that sample.
Sample dataframe:
structure(list(Family = c("Asteraceae", "Fabaceae", "Plantaginaceae",
"Hypericaceae", "Paulowniaceae", "Lamiaceae", "Apiaceae", "Rosaceae",
"Cymodoceaceae"), MKC09 = c(651L, 136298L, 127L, 34L, 0L, 0L,
0L, 0L, 0L), MKC100 = c(186371L, 0L, 61L, 0L, 53249L, 0L, 0L,
0L, 0L), MKC103 = c(246L, 0L, 234794L, 91L, 0L, 0L, 0L, 0L, 0L
), MKC104 = c(165L, 33L, 284329L, 0L, 0L, 0L, 0L, 0L, 0L), MKC105 = c(295L,
185706L, 111L, 37L, 30L, 0L, 0L, 0L, 0L), MKC106 = c(148433L,
66L, 13326L, 49L, 31655L, 0L, 0L, 0L, 0L), MKC109 = c(200921L,
65L, 34L, 32L, 54564L, 0L, 0L, 0L, 26L), MKC110 = c(161839L,
141L, 57948L, 40L, 797L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-9L))
Family MKC09 MKC100 MKC103 MKC104 MKC105 MKC106 MKC109 MKC110
1 Asteraceae 651 186371 246 165 295 148433 200921 161839
2 Fabaceae 136298 0 0 33 185706 66 65 141
3 Plantaginaceae 127 61 234794 284329 111 13326 34 57948
4 Hypericaceae 34 0 91 0 37 49 32 40
5 Paulowniaceae 0 53249 0 0 30 31655 54564 797
6 Lamiaceae 0 0 0 0 0 0 0 0
7 Apiaceae 0 0 0 0 0 0 0 0
8 Rosaceae 0 0 0 0 0 0 0 0
9 Cymodoceaceae 0 0 0 0 0 0 26 0
And I want a simple new dataframe, something like this to tell me which plant family is likely the main one for each sample:
Sample Family
MKC09 Fabaceae
MKC100 Asteraceae
MKC103 Plantaginaceae
MKC104 Plantaginaceae
MKC105 Fabaceae
MKC106 Asteraceae
MKC109 Asteraceae
MKC110 Asteraceae
I can get the top value using this code, but I don’t know how to then match it back to the first column to return the name instead:
colMax <- function(data) sapply(data, max, na.rm=TRUE)
newdf <- as.data.frame(colMax(my_data))
colMax(dat)
Family Rosaceae
MKC09 136298
MKC100 186371
MKC103 234794
MKC104 284329
MKC105 185706
MKC106 148433
MKC109 200921
MKC110 161839
I don’t want the first "Rosaceae" to show up, rather, I want the family names to show up that give that max value, like in my expected solution above.
>Solution :
A tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-Family) %>% # from wide to long format
group_by(name) %>% # grouping by 'name'
slice_max(value, n=1) # choosing first max value per group
# A tibble: 8 × 3
# Groups: name [8]
Family name value
<chr> <chr> <int>
1 Fabaceae MKC09 136298
2 Asteraceae MKC100 186371
3 Plantaginaceae MKC103 234794
4 Plantaginaceae MKC104 284329
5 Fabaceae MKC105 185706
6 Asteraceae MKC106 148433
7 Asteraceae MKC109 200921
8 Asteraceae MKC110 161839