Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Summarise values that contains a particular word

I’m trying to summarise values within my dataset that contain the word Paris, for example:

Data looks like this:

# A tibble: 21 x 2
   lc_names                    GHI
   <chr>                     <dbl>
 1 Oloron-Sainte-Marie       1252.
 2 Orléans                   1189.
 3 Palaiseau                 1162.
 4 Pamiers                   1388.
 5 Paris, 11e arrondissement 1149.
 6 Paris, 12e arrondissement 1148.
 7 Paris, 12e arronissement  1148.
 8 Paris, 13e arrondissement 1151.
 9 Paris, 14e arrondissement 1154.
10 Paris, 15e arrondissement 1154.
# … with 11 more rows

Example code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

agg_france %>% group_by(lc_names) %>% summarise(across(contains('Paris'), mean))

However, this does not produce the output expected.

Expected output:

 1 Oloron-Sainte-Marie       1252
 2 Orléans                   1189
 3 Palaiseau                 1162
 4 Pamiers                   1388
 5 Paris 1149                1149

Reproducible code:

structure(list(lc_names = c("Oloron-Sainte-Marie", "Orléans", 
"Palaiseau", "Pamiers", "Paris, 11e arrondissement", "Paris, 12e arrondissement", 
"Paris, 12e arronissement", "Paris, 13e arrondissement", "Paris, 14e arrondissement", 
"Paris, 15e arrondissement", "Paris, 16e arrondissement", "Paris, 17e arrondissement", 
"Paris, 18e arrondissement", "Paris, 19e arrondissement", "Paris, 1er arrondissement", 
"Paris, 20e arrondissement", "Paris, 2e arrondissement", "Paris, 3e arrondissement", 
"Paris, 4e arrondissement", "Paris, 5e arrondissement", "Paris, 6e arrondissement"
), GHI = c(1251.96996613576, 1189.28457903792, 1162.44717515565, 
1387.88195190278, 1148.93280509349, 1147.88753688082, 1147.68375976562, 
1151.08216545596, 1153.95916748047, 1153.74241428749, 1151.25635477972, 
1146.90766517376, 1145.5153276716, 1144.86586292338, 1149.09527587891, 
1146.63027231317, 1148.66392074093, 1148.57146857767, 1148.67533126532, 
1149.40849720348, 1151.19995944783)), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame"))

>Solution :

Here’s one way to accomplish this. The idea is to first str_match all strings containing Paris, and then create a new location variable that is either Paris or something else.

library(tidyverse)
library(stringr) 

data %>% mutate(loc = str_match(lc_names, 'Paris')) %>% 
         mutate(loc = ifelse(is.na(loc), lc_names, loc)) %>% 
         mutate(loc = as.vector(loc)) %>% # str_match returns a matrix
         group_by(loc) %>%
         summarise(avg_GHI = mean(GHI))

Output:

# A tibble: 5 x 2
  loc                 avg_GHI
  <chr>                 <dbl>
1 Oloron-Sainte-Marie   1252.
2 Orléans               1189.
3 Palaiseau             1162.
4 Pamiers               1388.
5 Paris                 1149.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading