Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Unifying variable level names based on manual entry text variations in R

The data I am using are based on self completed, mannually entered text responses to a questionnaire.

The problem is, especially with regards to fish species, people abbreviate, call them by different names, spell things wrong, etc.

How do I take all of the related names in a series of three columns and make them into one unified name so that I can perform analysis on them?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

structure(list(species_1_target = c("Blacktail", "Craker", "Black tail", 
"Musselcracker", "Musselcracker", "Garick", "White musselcracker", 
"Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen", 
"Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything", 
"poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail", 
"steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras", 
"Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker", 
"Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie", 
NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker", 
"blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus", 
"Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark", 
"Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks", 
NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob", 
NA)), class = "data.frame", row.names = 3:29)

For example: "black tail", "blacktail", "Black Tail", "Blacktail" etc. could all be renamed "blacktail"
"Steenies", "steenbras", "steenie", "steambras" -> "Steenbras"

Is there a way to do this using partial string matching to save time/effort?

I know how to manually rename each one, however, there are over 150 rows in the full data and having to find and rename each unique variation can be tedious to say the least.
Additionally, data entries are ongoing so the list of inaccurately entered species names will likely continue to grow.

>Solution :

Probably not optimal, but it works.

library(tidyverse)
df <- structure(list(species_1_target = c("Blacktail", "Craker", "Black tail", 
                                          "Musselcracker", "Musselcracker", "Garick", "White musselcracker", 
                                          "Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen", 
                                          "Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything", 
                                          "poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail", 
                                          "steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras", 
                                                                                                               "Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker", 
                                                                                                               "Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie", 
                                                                                                               NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker", 
                                                                                                               "blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus", 
                                                                                                                                                                                    "Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark", 
                                                                                                                                                                                    "Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks", 
                                                                                                                                                                                    NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob", 
                                                                                                                                                                                    NA)), class = "data.frame", row.names = 3:29) 

df %>%  
  as_tibble() %>%  
  mutate(across(everything(), ~ str_to_lower(.x)),
         across(everything(), 
                ~ case_when(str_detect(.x, "black") ~ "blacktail",
                            str_detect(.x, "steen") ~ "steenbras",
                            TRUE ~ .x) ),
         across(everything(), ~ str_to_title(.x)))
#> # A tibble: 27 x 3
#>    species_1_target    species_2_target species_3_target
#>    <chr>               <chr>            <chr>           
#>  1 Blacktail           Steenbras        Octopus         
#>  2 Craker              Steambrass       Blacktail       
#>  3 Blacktail           <NA>             <NA>            
#>  4 Musselcracker       Elf              Steenbras       
#>  5 Musselcracker       Galjoen          <NA>            
#>  6 Garick              Elf              Kob             
#>  7 White Musselcracker Blacktail        Kob             
#>  8 Blacktail           Stumpnose        Sandshark       
#>  9 Kob                 Bluefish         Steenbras       
#> 10 Any                 <NA>             <NA>            
#> # ... with 17 more rows

Created on 2022-06-16 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading