Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Selecting top and bottom 10% to be plotted in ggplot2

I would like to easily select the top and bottom 10% of the mean of a variable to be plotted in a ggplot. I have a larger data set over a period of 2 years where "Treatments" have been repeated. I would like to find the mean over the 2 years and only plot the Treatments whose means are the top and bottom 10% of all treatments.

Currently I have been creating the plot with all Treatments for a certain variable, then finding the top and bottom 10% and selecting those Treatments to only be included in the final graph using subset(). This is too time consuming and cannot be easily transferred to another variable.

I’ve replicated this using the starwars data set:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ggplot(subset(starwars,homeworld %in% c("Quermia","Kashyyyk","Kalee","Kamino","Aleen Minor","Endor","Vulpter","Malastare")), aes(x=`homeworld`, y=`height`, fill = homeworld)) +
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
  theme(legend.position = 'none') +
  theme(axis.text.x = element_text(angle = 40,hjust = 1, vjust = 1,face = "bold",
                                   colour = "black", size = rel(0.8)))

Ideally, I would have a line of code that could be copied and used for, in the starwars example, mass instead of height. Using my current approach, I would have to plot all homeworlds and then select the ones I would like to add in the final plot.

>Solution :

One way to do this would be to create a function that does the subsetting for you, so you don’t have to that’s what the best_worst() function below does. It takes you data, a grouping variable and the variable whose mean you want to calculate and returns the prop*n groups with the highest and lowest means. You can then use this data in the plot.

library(dplyr)
library(ggplot2)
best_worst <- function(.data, .group, .vbl, prop = .1, ...){
  sum_data <- .data %>% 
    group_by({{.group}}) %>% 
    filter(!is.na({{.vbl}})) %>% 
    summarise(x = mean({{.vbl}}, na.rm=TRUE)) %>% 
    arrange(x)
  n <- nrow(sum_data)
  n_keep <- floor(n*prop)
  top <- sum_data %>% 
          ungroup %>% 
          slice_head(n=n_keep) %>% 
          select({{.group}}) %>% 
          pull()
  bottom <- sum_data %>% 
    ungroup %>% 
    slice_tail(n=n_keep) %>% 
    select({{.group}}) %>% 
    pull()
  .data %>% filter({{.group}} %in% c(top, bottom))
}
ggplot(best_worst(starwars, homeworld, height), aes(x=`homeworld`, y=`height`, fill = homeworld)) +
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun=mean, geom="point", shape=20, size=5, color="red", fill="red") +
  theme(legend.position = 'none') +
  theme(axis.text.x = element_text(angle = 40,hjust = 1, vjust = 1,face = "bold",
                                   colour = "black", size = rel(0.8)))

Created on 2023-09-26 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading