Home New identifier column to dataframe based on whether string contains said identifier

Questions

New identifier column to dataframe based on whether string contains said identifier

December 15, 2022

I am an absolute novice to R. What I would like to achieve is to have an identifier added to each dataframe row based on whether a string value in the same row contains that identifier.

Assume dataframe:

df <- data.frame(Code = c("DE8230", "18FR16", "2UK34", "45BE87C", "1894DE56", "AB12FR", "ES12456"),
                 Type = c("A", "B", "C", "C", "E", "A", "C"),
                 Value = c(12, 14, 8, 20, 21, 16, 5))


      Code Type Value
1   DE8230    A    12
2   18FR16    B    14
3    2UK34    C     8
4  45BE87C    C    20
5 1894DE56    E    21
6   AB12FR    A    16
7  ES12456    C     5

I want to add a country column based on whether an identifier (e.g. DE, FR, UK, BE, ES) is present in the column ‘Code’ and than to list that country.

What I tried:

identifiers <- c("DE", "FR", "UK") #identifiers of choice

df <- mutate(df, country = 0)

for (i in 1:length(identifiers)){
  df <- mutate(df,
          country = ifelse(grepl(identifiers[i], Code), identifiers[i], country)
  )
}

Which yields:

      Code Type Value country
1   DE8230    A    12      DE
2   18FR16    B    14      FR
3    2UK34    C     8      UK
4 1894DE56    C    20      DE
5   AB12FR    E    21      FR

Although this works, I think there must be a much more elegant solution omitting the for loop and just using same dplyr statement. However, I have not been able to figure it out.

N.b.: It is important that the mentioned identifiers are listed in a separate vector or list and not part of the mutate statement. This is just a hypothetical example, datasets and number of identifiers are much larger.

>Solution :

We may use str_extract by pasteing the identifiers as a single string with | separator and extract those substring from the ‘Code’

library(dplyr)
library(stringr)
df %>% 
  mutate(country = str_extract(Code, str_c(identifiers, collapse = "|"))) %>% 
   drop_na(country)

-output

      Code Type Value country
1   DE8230    A    12      DE
2   18FR16    B    14      FR
3    2UK34    C     8      UK
4 1894DE56    E    21      DE
5   AB12FR    A    16      FR