The data I am using are based on self completed, mannually entered text responses to a questionnaire.
The problem is, especially with regards to fish species, people abbreviate, call them by different names, spell things wrong, etc.
How do I take all of the related names in a series of three columns and make them into one unified name so that I can perform analysis on them?
structure(list(species_1_target = c("Blacktail", "Craker", "Black tail",
"Musselcracker", "Musselcracker", "Garick", "White musselcracker",
"Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen",
"Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything",
"poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail",
"steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras",
"Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker",
"Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie",
NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker",
"blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus",
"Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark",
"Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks",
NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob",
NA)), class = "data.frame", row.names = 3:29)
For example: "black tail", "blacktail", "Black Tail", "Blacktail" etc. could all be renamed "blacktail"
"Steenies", "steenbras", "steenie", "steambras" -> "Steenbras"
Is there a way to do this using partial string matching to save time/effort?
I know how to manually rename each one, however, there are over 150 rows in the full data and having to find and rename each unique variation can be tedious to say the least.
Additionally, data entries are ongoing so the list of inaccurately entered species names will likely continue to grow.
>Solution :
Probably not optimal, but it works.
library(tidyverse)
df <- structure(list(species_1_target = c("Blacktail", "Craker", "Black tail",
"Musselcracker", "Musselcracker", "Garick", "White musselcracker",
"Blacktails", "Kob", "Any", "White cracker", "Musslecrack", "Galjoen",
"Cracker", "Blacktail", "White Musselcracker", "Blacktail", "Anything",
"poensie", "cracker", "cracker", "cracker", "glajoen", "blacktail",
"steenie", "Musselcracker white", "Steenbras"), species_2_target = c("Steenbras",
"Steambrass", NA, "Elf", "Galjoen", "Elf", "Black musselcracker",
"Stumpnose", "Bluefish", NA, "Blacktail", "Steenie", "Kob", "Poensie",
NA, NA, "Steenies", NA, "cracker", "galjoen", NA, NA, NA, "cracker",
"blacktail", "Black musselcracker", "Galjoen"), species_3_target = c("Octopus",
"Black tail", NA, "Steenbrass", NA, "Kob", "Kob", "Sandshark",
"Steen ras", NA, NA, "Kob", NA, "Kob", NA, NA, "Sandsharks",
NA, "gully shark", "gully shark", NA, NA, NA, NA, NA, "Kob",
NA)), class = "data.frame", row.names = 3:29)
df %>%
as_tibble() %>%
mutate(across(everything(), ~ str_to_lower(.x)),
across(everything(),
~ case_when(str_detect(.x, "black") ~ "blacktail",
str_detect(.x, "steen") ~ "steenbras",
TRUE ~ .x) ),
across(everything(), ~ str_to_title(.x)))
#> # A tibble: 27 x 3
#> species_1_target species_2_target species_3_target
#> <chr> <chr> <chr>
#> 1 Blacktail Steenbras Octopus
#> 2 Craker Steambrass Blacktail
#> 3 Blacktail <NA> <NA>
#> 4 Musselcracker Elf Steenbras
#> 5 Musselcracker Galjoen <NA>
#> 6 Garick Elf Kob
#> 7 White Musselcracker Blacktail Kob
#> 8 Blacktail Stumpnose Sandshark
#> 9 Kob Bluefish Steenbras
#> 10 Any <NA> <NA>
#> # ... with 17 more rows
Created on 2022-06-16 by the reprex package (v2.0.1)