I have a dataframe such as the following:
ID country region
1 32 32001
2 32 1001
3 68 68001
4 214 19017
5 214 214017
All variables are character, despite being integers.
I am cleaning the data and need all of the region
variable to be the same. However, due to a clerical error in the dataset, some observations randomly switch from beginning the region
code with the country
code as the first two digits to using ascending numbers 1-19 (Countries 1-19). I therefore want to replace the first digit of region
for all these incorrectly-coded cases (where region
> 1000 & < 10000) with that row’s country code, for regions starting with numbers 1-9. Then, for all incorrectly-coded regions (where region
> 10000 & < 20000) starting 10-19, I want to replace the first two digits of all these observations with that row’s country code. All other observations would be left the same. The desired output would be the following:
ID Country Region
1 32 32001
2 32 32001
3 68 68001
4 214 214017
5 214 214017
The maximum value of these incorrectly coded regions is 19027 and the minimum value of the correctly coded regions is 32001, so there would be no overlap to do this algorithmically. The only point would be to make sure that regions with a value greater than 9999 have their first two digits replaced, not just the first digit. Is there a way to do this succinctly, rather than manually recoding these all?
>Solution :
This is a one-liner in base R:
transform(df, region = paste0(country, substr(region, nchar(region) - 2, 9)))
#> ID country region
#> 1 1 32 32001
#> 2 2 32 32001
#> 3 3 68 68001
#> 4 4 214 214017
#> 5 5 214 214017
Reproducible data
df <- structure(list(ID = c("1", "2", "3", "4", "5"), country = c("32",
"32", "68", "214", "214"), region = c("32001", "1001", "68001",
"19017", "214017")), row.names = c(NA, -5L), class = "data.frame")
df
#> ID country region
#> 1 1 32 32001
#> 2 2 32 1001
#> 3 3 68 68001
#> 4 4 214 19017
#> 5 5 214 214017
Created on 2023-03-18 with reprex v2.0.2