Unifying variable level names based on manual entry text variations in R

June 16, 2022

The data I am using are based on self completed, mannually entered text responses to a questionnaire.

The problem is, especially with regards to fish species, people abbreviate, call them by different names, spell things wrong, etc.

How do I take all of the related names in a series of three columns and make them into one unified name so that I can perform analysis on them?

structure(list(species_1_target = c("Blacktail", "Craker", "Black tail", 
"Musselcracker", "Musselcracker", "Garick", "White musselcracker", 
"Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen", 
"Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything", 
"poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail", 
"steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras", 
"Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker", 
"Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie", 
NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker", 
"blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus", 
"Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark", 
"Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks", 
NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob", 
NA)), class = "data.frame", row.names = 3:29)

For example: "black tail", "blacktail", "Black Tail", "Blacktail" etc. could all be renamed "blacktail"
"Steenies", "steenbras", "steenie", "steambras" -> "Steenbras"

Is there a way to do this using partial string matching to save time/effort?

I know how to manually rename each one, however, there are over 150 rows in the full data and having to find and rename each unique variation can be tedious to say the least.
Additionally, data entries are ongoing so the list of inaccurately entered species names will likely continue to grow.

>Solution :

Probably not optimal, but it works.

library(tidyverse)
df <- structure(list(species_1_target = c("Blacktail", "Craker", "Black tail", 
                                          "Musselcracker", "Musselcracker", "Garick", "White musselcracker", 
                                          "Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen", 
                                          "Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything", 
                                          "poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail", 
                                          "steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras", 
                                                                                                               "Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker", 
                                                                                                               "Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie", 
                                                                                                               NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker", 
                                                                                                               "blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus", 
                                                                                                                                                                                    "Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark", 
                                                                                                                                                                                    "Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks", 
                                                                                                                                                                                    NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob", 
                                                                                                                                                                                    NA)), class = "data.frame", row.names = 3:29) 

df %>%  
  as_tibble() %>%  
  mutate(across(everything(), ~ str_to_lower(.x)),
         across(everything(), 
                ~ case_when(str_detect(.x, "black") ~ "blacktail",
                            str_detect(.x, "steen") ~ "steenbras",
                            TRUE ~ .x) ),
         across(everything(), ~ str_to_title(.x)))
#> # A tibble: 27 x 3
#>    species_1_target    species_2_target species_3_target
#>    <chr>               <chr>            <chr>           
#>  1 Blacktail           Steenbras        Octopus         
#>  2 Craker              Steambrass       Blacktail       
#>  3 Blacktail           <NA>             <NA>            
#>  4 Musselcracker       Elf              Steenbras       
#>  5 Musselcracker       Galjoen          <NA>            
#>  6 Garick              Elf              Kob             
#>  7 White Musselcracker Blacktail        Kob             
#>  8 Blacktail           Stumpnose        Sandshark       
#>  9 Kob                 Bluefish         Steenbras       
#> 10 Any                 <NA>             <NA>            
#> # ... with 17 more rows

^{Created on 2022-06-16 by the reprex package (v2.0.1)}