Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create a temporary group in dplyr group_by

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.

Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f

I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

summary <- data %>% 
  group_by(site_code, species_scientific) %>% 
  summarize(mean_size = mean(width_mm))

Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.

>Solution :

We may replace the species_scientific by replaceing the elements that have the substring ‘Macoma’ (str_detect) with ‘Macoma’, use that as grouping column and get the mean

library(dplyr)
library(stringr)
data %>%
   mutate(species_scientific = replace(species_scientific, 
      str_detect(species_scientific, "Macoma"), "Macoma")) %>%
    group_by(site_code, species_scientific) %>%
    summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 97 × 3
   site_code species_scientific     mean_size
   <chr>     <chr>                      <dbl>
 1 H_01_a    Clinocardium nuttallii      33.9
 2 H_01_a    Macoma                      41.0
 3 H_01_a    Protothaca staminea         37.3
 4 H_01_a    Saxidomus gigantea          56.0
 5 H_01_a    Tresus nuttallii           100. 
 6 H_02_a    Clinocardium nuttallii      35.1
 7 H_02_a    Macoma                      41.3
 8 H_02_a    Protothaca staminea         38.0
 9 H_02_a    Saxidomus gigantea          54.7
10 H_02_a    Tresus nuttallii            50.5
# … with 87 more rows

If the intention is to keep only the first word in ‘species_scientific’

data %>% 
  group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
   summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 97 × 3
   genus        site_code mean_size
   <chr>        <chr>         <dbl>
 1 Clinocardium H_01_a         33.9
 2 Clinocardium H_02_a         35.1
 3 Clinocardium H_03_a         37.5
 4 Clinocardium H_04_a         48.2
 5 Clinocardium H_05_a         37.6
 6 Clinocardium H_06_a         38.7
 7 Clinocardium H_07_a         40.2
 8 Clinocardium L_01_a         44.4
 9 Clinocardium L_02_a         54.8
10 Clinocardium L_03_a         61.1
# … with 87 more rows
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading