Get the categorical value for each column based on sum

March 13, 2024

Say I have a dataframe with families and I have the read counts for each sample. I want to return another dataframe that just shows me the sample in column 1, and which family is likely the biggest contributor to the total read count in that sample.

Sample dataframe:

structure(list(Family = c("Asteraceae", "Fabaceae", "Plantaginaceae", 
"Hypericaceae", "Paulowniaceae", "Lamiaceae", "Apiaceae", "Rosaceae", 
"Cymodoceaceae"), MKC09 = c(651L, 136298L, 127L, 34L, 0L, 0L, 
0L, 0L, 0L), MKC100 = c(186371L, 0L, 61L, 0L, 53249L, 0L, 0L, 
0L, 0L), MKC103 = c(246L, 0L, 234794L, 91L, 0L, 0L, 0L, 0L, 0L
), MKC104 = c(165L, 33L, 284329L, 0L, 0L, 0L, 0L, 0L, 0L), MKC105 = c(295L, 
185706L, 111L, 37L, 30L, 0L, 0L, 0L, 0L), MKC106 = c(148433L, 
66L, 13326L, 49L, 31655L, 0L, 0L, 0L, 0L), MKC109 = c(200921L, 
65L, 34L, 32L, 54564L, 0L, 0L, 0L, 26L), MKC110 = c(161839L, 
141L, 57948L, 40L, 797L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-9L))

          Family  MKC09 MKC100 MKC103 MKC104 MKC105 MKC106 MKC109 MKC110
1     Asteraceae    651 186371    246    165    295 148433 200921 161839
2       Fabaceae 136298      0      0     33 185706     66     65    141
3 Plantaginaceae    127     61 234794 284329    111  13326     34  57948
4   Hypericaceae     34      0     91      0     37     49     32     40
5  Paulowniaceae      0  53249      0      0     30  31655  54564    797
6      Lamiaceae      0      0      0      0      0      0      0      0
7       Apiaceae      0      0      0      0      0      0      0      0
8       Rosaceae      0      0      0      0      0      0      0      0
9  Cymodoceaceae      0      0      0      0      0      0     26      0

And I want a simple new dataframe, something like this to tell me which plant family is likely the main one for each sample:

Sample    Family
MKC09     Fabaceae
MKC100    Asteraceae
MKC103    Plantaginaceae
MKC104    Plantaginaceae
MKC105    Fabaceae
MKC106    Asteraceae
MKC109    Asteraceae
MKC110    Asteraceae

I can get the top value using this code, but I don’t know how to then match it back to the first column to return the name instead:

colMax <- function(data) sapply(data, max, na.rm=TRUE)
newdf <- as.data.frame(colMax(my_data))

       colMax(dat)
Family    Rosaceae
MKC09       136298
MKC100      186371
MKC103      234794
MKC104      284329
MKC105      185706
MKC106      148433
MKC109      200921
MKC110      161839

I don’t want the first "Rosaceae" to show up, rather, I want the family names to show up that give that max value, like in my expected solution above.

>Solution :

A tidyverse approach:

library(dplyr)
library(tidyr)
 

df %>% 
  pivot_longer(-Family) %>%  # from wide to long format
  group_by(name) %>%         # grouping by 'name' 
  slice_max(value, n=1)      # choosing first max value per group

# A tibble: 8 × 3
# Groups:   name [8]
  Family         name    value
  <chr>          <chr>   <int>
1 Fabaceae       MKC09  136298
2 Asteraceae     MKC100 186371
3 Plantaginaceae MKC103 234794
4 Plantaginaceae MKC104 284329
5 Fabaceae       MKC105 185706
6 Asteraceae     MKC106 148433
7 Asteraceae     MKC109 200921
8 Asteraceae     MKC110 161839