I’m looking for a two-digit number that comes before the word "years" and a seven- or eight-digit number that comes after the word "years." An example of the data is shown below.
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
data <- as.list(data)
I tried this approach and was successful in getting two digit numbers before the word "years" :
stringr::str_extract_all(data,regex(".\\d{2}\\s(?:year)"))
I also tried this method to get the number after word "years" :
str_extract_all(data,regex(".\\d{2}\\s(?:year).\\d{7,8}"))
I managed to get the number that appear directly after the word years :
" 57 year 7654321"
However, I was unsuccessful in getting eight digit numbers following the word "years" that included other characters in between the number and the word "years".
How can I retrieve the number only after the word "years" by skipping this other word/character?
I really appreciate your help
>Solution :
We may use str_replace to match and remove the non-digits before and after the ‘years’ and then extract the digits before and after the years including the ‘years’
library(stringr)
str_extract_all(str_replace_all(data,
"(?<=years)\\D+|(\\D+)(?=years)", " "), "\\d{2}\\s+years\\s+\\d{7,8}")[[1]]
[1] "45 years 12345678" "57 years 7654321"
Or another option is to capture the digits, along with the ‘years’ substring with str_match and then paste them together
library(purrr)
library(dplyr)
str_match_all(data, "(\\d{2})\\D+(years)\\D+(\\d{7,8})")[[1]][,-1] %>%
as.data.frame %>%
invoke(str_c, sep =" ", .)
[1] "45 years 12345678" "57 years 7654321"
data
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"