Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Conditionally replace first character (or first two) of a column with value from that row's other column (R)

I have a dataframe such as the following:

ID country region

1 32         32001
2 32          1001
3 68         68001
4 214        19017
5 214       214017

All variables are character, despite being integers.

I am cleaning the data and need all of the region variable to be the same. However, due to a clerical error in the dataset, some observations randomly switch from beginning the region code with the country code as the first two digits to using ascending numbers 1-19 (Countries 1-19). I therefore want to replace the first digit of region for all these incorrectly-coded cases (where region > 1000 & < 10000) with that row’s country code, for regions starting with numbers 1-9. Then, for all incorrectly-coded regions (where region > 10000 & < 20000) starting 10-19, I want to replace the first two digits of all these observations with that row’s country code. All other observations would be left the same. The desired output would be the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID Country Region

1 32         32001
2 32         32001
3 68         68001
4 214       214017
5 214       214017

The maximum value of these incorrectly coded regions is 19027 and the minimum value of the correctly coded regions is 32001, so there would be no overlap to do this algorithmically. The only point would be to make sure that regions with a value greater than 9999 have their first two digits replaced, not just the first digit. Is there a way to do this succinctly, rather than manually recoding these all?

>Solution :

This is a one-liner in base R:

transform(df, region = paste0(country, substr(region, nchar(region) - 2, 9)))
#>   ID country region
#> 1  1      32  32001
#> 2  2      32  32001
#> 3  3      68  68001
#> 4  4     214 214017
#> 5  5     214 214017

Reproducible data

df <- structure(list(ID = c("1", "2", "3", "4", "5"), country = c("32", 
"32", "68", "214", "214"), region = c("32001", "1001", "68001", 
"19017", "214017")), row.names = c(NA, -5L), class = "data.frame")

df
#>   ID country region
#> 1  1      32  32001
#> 2  2      32   1001
#> 3  3      68  68001
#> 4  4     214  19017
#> 5  5     214 214017

Created on 2023-03-18 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading