Create a temporary group in dplyr group_by

August 3, 2022

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.

Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f

I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)

summary <- data %>% 
  group_by(site_code, species_scientific) %>% 
  summarize(mean_size = mean(width_mm))

Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.

>Solution :

We may replace the species_scientific by replaceing the elements that have the substring ‘Macoma’ (str_detect) with ‘Macoma’, use that as grouping column and get the mean

library(dplyr)
library(stringr)
data %>%
   mutate(species_scientific = replace(species_scientific, 
      str_detect(species_scientific, "Macoma"), "Macoma")) %>%
    group_by(site_code, species_scientific) %>%
    summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 97 × 3
   site_code species_scientific     mean_size
   <chr>     <chr>                      <dbl>
 1 H_01_a    Clinocardium nuttallii      33.9
 2 H_01_a    Macoma                      41.0
 3 H_01_a    Protothaca staminea         37.3
 4 H_01_a    Saxidomus gigantea          56.0
 5 H_01_a    Tresus nuttallii           100. 
 6 H_02_a    Clinocardium nuttallii      35.1
 7 H_02_a    Macoma                      41.3
 8 H_02_a    Protothaca staminea         38.0
 9 H_02_a    Saxidomus gigantea          54.7
10 H_02_a    Tresus nuttallii            50.5
# … with 87 more rows

If the intention is to keep only the first word in ‘species_scientific’

data %>% 
  group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
   summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 97 × 3
   genus        site_code mean_size
   <chr>        <chr>         <dbl>
 1 Clinocardium H_01_a         33.9
 2 Clinocardium H_02_a         35.1
 3 Clinocardium H_03_a         37.5
 4 Clinocardium H_04_a         48.2
 5 Clinocardium H_05_a         37.6
 6 Clinocardium H_06_a         38.7
 7 Clinocardium H_07_a         40.2
 8 Clinocardium L_01_a         44.4
 9 Clinocardium L_02_a         54.8
10 Clinocardium L_03_a         61.1
# … with 87 more rows