Conditionally replace first character (or first two) of a column with value from that row's other column (R)

Advertisements

I have a dataframe such as the following:

ID country region

1 32         32001
2 32          1001
3 68         68001
4 214        19017
5 214       214017

All variables are character, despite being integers.

I am cleaning the data and need all of the region variable to be the same. However, due to a clerical error in the dataset, some observations randomly switch from beginning the region code with the country code as the first two digits to using ascending numbers 1-19 (Countries 1-19). I therefore want to replace the first digit of region for all these incorrectly-coded cases (where region > 1000 & < 10000) with that row’s country code, for regions starting with numbers 1-9. Then, for all incorrectly-coded regions (where region > 10000 & < 20000) starting 10-19, I want to replace the first two digits of all these observations with that row’s country code. All other observations would be left the same. The desired output would be the following:

ID Country Region

1 32         32001
2 32         32001
3 68         68001
4 214       214017
5 214       214017

The maximum value of these incorrectly coded regions is 19027 and the minimum value of the correctly coded regions is 32001, so there would be no overlap to do this algorithmically. The only point would be to make sure that regions with a value greater than 9999 have their first two digits replaced, not just the first digit. Is there a way to do this succinctly, rather than manually recoding these all?

>Solution :

This is a one-liner in base R:

transform(df, region = paste0(country, substr(region, nchar(region) - 2, 9)))
#>   ID country region
#> 1  1      32  32001
#> 2  2      32  32001
#> 3  3      68  68001
#> 4  4     214 214017
#> 5  5     214 214017

Reproducible data

df <- structure(list(ID = c("1", "2", "3", "4", "5"), country = c("32", 
"32", "68", "214", "214"), region = c("32001", "1001", "68001", 
"19017", "214017")), row.names = c(NA, -5L), class = "data.frame")

df
#>   ID country region
#> 1  1      32  32001
#> 2  2      32   1001
#> 3  3      68  68001
#> 4  4     214  19017
#> 5  5     214 214017

Created on 2023-03-18 with reprex v2.0.2

Leave a Reply Cancel reply