Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

New identifier column to dataframe based on whether string contains said identifier

I am an absolute novice to R. What I would like to achieve is to have an identifier added to each dataframe row based on whether a string value in the same row contains that identifier.

Assume dataframe:

df <- data.frame(Code = c("DE8230", "18FR16", "2UK34", "45BE87C", "1894DE56", "AB12FR", "ES12456"),
                 Type = c("A", "B", "C", "C", "E", "A", "C"),
                 Value = c(12, 14, 8, 20, 21, 16, 5))


      Code Type Value
1   DE8230    A    12
2   18FR16    B    14
3    2UK34    C     8
4  45BE87C    C    20
5 1894DE56    E    21
6   AB12FR    A    16
7  ES12456    C     5

I want to add a country column based on whether an identifier (e.g. DE, FR, UK, BE, ES) is present in the column ‘Code’ and than to list that country.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I tried:

identifiers <- c("DE", "FR", "UK") #identifiers of choice

df <- mutate(df, country = 0)

for (i in 1:length(identifiers)){
  df <- mutate(df,
          country = ifelse(grepl(identifiers[i], Code), identifiers[i], country)
  )
}

Which yields:

      Code Type Value country
1   DE8230    A    12      DE
2   18FR16    B    14      FR
3    2UK34    C     8      UK
4 1894DE56    C    20      DE
5   AB12FR    E    21      FR

Although this works, I think there must be a much more elegant solution omitting the for loop and just using same dplyr statement. However, I have not been able to figure it out.

N.b.: It is important that the mentioned identifiers are listed in a separate vector or list and not part of the mutate statement. This is just a hypothetical example, datasets and number of identifiers are much larger.

>Solution :

We may use str_extract by pasteing the identifiers as a single string with | separator and extract those substring from the ‘Code’

library(dplyr)
library(stringr)
df %>% 
  mutate(country = str_extract(Code, str_c(identifiers, collapse = "|"))) %>% 
   drop_na(country)

-output

      Code Type Value country
1   DE8230    A    12      DE
2   18FR16    B    14      FR
3    2UK34    C     8      UK
4 1894DE56    E    21      DE
5   AB12FR    A    16      FR
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading