Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to perform majority voting from a data frame with ranking criteria

I have the following data frame:

dat <- structure(list(model_name = c("Random Forest", "XGBoost", "XGBoost-reg", 
"Null model", "Plain LM", "Elastic LM", "LM-pep.charge", "LM-rf.10vip"
), RMSE = c(0.853, 0.886, 0.719, 2.41, 16.6, 0.731, 1.16, 1.03
), MAE = c(0.545, 0.708, 0.589, 1.98, 8.6, 0.588, 0.874, 0.729
), `R^2` = c(0.806, 0.865, 0.915, NA, 0.0645, 0.927, 0.8, 0.822
), ccc = c(0.89, 0.928, 0.951, 0, 0.0685, 0.945, 0.847, 0.901
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))


It looks like this:

  model_name      RMSE   MAE   `R^2`    ccc
  <chr>          <dbl> <dbl>   <dbl>  <dbl>
1 Random Forest  0.853 0.545  0.806  0.89  
2 XGBoost        0.886 0.708  0.865  0.928 
3 XGBoost-reg    0.719 0.589  0.915  0.951 
4 Null model     2.41  1.98  NA      0     
5 Plain LM      16.6   8.6    0.0645 0.0685
6 Elastic LM     0.731 0.588  0.927  0.945 
7 LM-pep.charge  1.16  0.874  0.8    0.847 
8 LM-rf.10vip    1.03  0.729  0.822  0.901 

It stores the evaluation metrics for 8 prediction models.
My goal is to select the top-performing model that consistently excels in the majority of evaluations.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

By manually evaluating the metrics, I determined the top performing model this way:

Metrics -> Top 1
-----------------
RMSE -> XGBoost-reg 
MAE -> RF
R^2 -> Elastic LM 
CCC -> XGBoost-reg 

# Therefore, the winner is XGBoost-reg

It’s worth noting that RMSE and MAE are error measures, with lower values indicating better performance, while R^2 and CCC are correlation measures, with higher values indicating better performance.

How can I do this with R?

>Solution :

We may either convert the data into ‘long’ format, do a group by ‘name’ and get the row with lowest value of ‘value1’ (after modifying the case for R^2 and ccc – multiplying by -1), then get the frequency count and select the first row

library(dplyr)
library(tidyr)
dat %>% 
  pivot_longer(cols = -model_name, values_drop_na = TRUE) %>% 
  mutate(value1 = case_when(name %in% c("R^2", "ccc")~ value * -1, 
     TRUE ~ value)) %>% 
  group_by(name) %>% 
  slice_min(n = 1, value1) %>%
  ungroup %>%
  count(model_name, sort = TRUE) %>%
  slice_head(n = 1)

-output

# A tibble: 1 × 2
  model_name      n
  <chr>       <int>
1 XGBoost-reg     2

Or do the summarise to select the model_name from the numeric columns based on the min/max index and then get the count after converting to ‘long’ format

dat %>% 
  summarise(across(where(is.numeric), 
  ~ if(cur_column() %in% c("R^2", "ccc")) 
   model_name[which.max(.x)] else model_name[which.min(.x)])) %>% 
  pivot_longer(cols = everything(), names_to = NULL) %>% 
  count(value, sort = TRUE) %>%
  slice_head(n = 1)

-output

# A tibble: 1 × 2
  value           n
  <chr>       <int>
1 XGBoost-reg     2

Or with base R

names(which.max(table(dat$model_name[max.col(t(replace(dat[-1], 
   is.na(dat[-1]), -Inf) * list(-1, -1, 1, 1)), 'first')])))
[1] "XGBoost-reg"
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading