Selecting top and bottom 10% to be plotted in ggplot2

September 26, 2023

I would like to easily select the top and bottom 10% of the mean of a variable to be plotted in a ggplot. I have a larger data set over a period of 2 years where "Treatments" have been repeated. I would like to find the mean over the 2 years and only plot the Treatments whose means are the top and bottom 10% of all treatments.

Currently I have been creating the plot with all Treatments for a certain variable, then finding the top and bottom 10% and selecting those Treatments to only be included in the final graph using subset(). This is too time consuming and cannot be easily transferred to another variable.

I’ve replicated this using the starwars data set:

ggplot(subset(starwars,homeworld %in% c("Quermia","Kashyyyk","Kalee","Kamino","Aleen Minor","Endor","Vulpter","Malastare")), aes(x=`homeworld`, y=`height`, fill = homeworld)) +
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
  theme(legend.position = 'none') +
  theme(axis.text.x = element_text(angle = 40,hjust = 1, vjust = 1,face = "bold",
                                   colour = "black", size = rel(0.8)))

Ideally, I would have a line of code that could be copied and used for, in the starwars example, mass instead of height. Using my current approach, I would have to plot all homeworlds and then select the ones I would like to add in the final plot.

>Solution :

One way to do this would be to create a function that does the subsetting for you, so you don’t have to that’s what the best_worst() function below does. It takes you data, a grouping variable and the variable whose mean you want to calculate and returns the prop*n groups with the highest and lowest means. You can then use this data in the plot.

library(dplyr)
library(ggplot2)
best_worst <- function(.data, .group, .vbl, prop = .1, ...){
  sum_data <- .data %>% 
    group_by({{.group}}) %>% 
    filter(!is.na({{.vbl}})) %>% 
    summarise(x = mean({{.vbl}}, na.rm=TRUE)) %>% 
    arrange(x)
  n <- nrow(sum_data)
  n_keep <- floor(n*prop)
  top <- sum_data %>% 
          ungroup %>% 
          slice_head(n=n_keep) %>% 
          select({{.group}}) %>% 
          pull()
  bottom <- sum_data %>% 
    ungroup %>% 
    slice_tail(n=n_keep) %>% 
    select({{.group}}) %>% 
    pull()
  .data %>% filter({{.group}} %in% c(top, bottom))
}

ggplot(best_worst(starwars, homeworld, height), aes(x=`homeworld`, y=`height`, fill = homeworld)) +
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun=mean, geom="point", shape=20, size=5, color="red", fill="red") +
  theme(legend.position = 'none') +
  theme(axis.text.x = element_text(angle = 40,hjust = 1, vjust = 1,face = "bold",
                                   colour = "black", size = rel(0.8)))