I have a dataset that looks like this:
> dput(df)
structure(list(Person.ID = c(123L, 234L), Date = c("10/10/09",
"11/11/03"), Text = c("Here are some random words that I do not want. The person was allowed to cool to a core body temperature of 16.5 degrees centigrade. Here are some other random words I do not want.",
"Here are some random words that I do not want. A cooling mechanism was applied to cool the patient to a core body temperature of 19.1 degrees centigrade. Here are some other random words I do not want."
)), class = "data.frame", row.names = c(NA, -2L))
For each person, I would like to extract the sentence that mentions their body temperature (and get rid of the unwanted words). I would also like a separate column that ONLY mentions the temperature. The desired output should look like:
> dput(df2)
structure(list(Person.ID = c(123L, 234L), Date = c("10/10/09",
"11/11/03"), Text = c("The person was allowed to cool to a core body temperature of 16.5 degrees centigrade.",
"A cooling mechanism was applied to cool the patient to a core body temperature of 19.1 degrees centigrade. "
), Value = c(16.5, 19.1)), class = "data.frame", row.names = c(NA,
-2L))
>Solution :
Here is one option
library(stringr)
library(tidyr)
library(dplyr)
df %>%
separate_longer_delim(Text, delim = regex("(?<=\\.)\\s+")) %>%
filter(str_detect(Text, "temperature")) %>%
mutate(Value = as.numeric(str_extract(Text, "\\d+\\.?\\d+?")))
-output
Person.ID Date
1 123 10/10/09
2 234 11/11/03
Text Value
1 The person was allowed to cool to a core body temperature of 16.5 degrees centigrade. 16.5
2 A cooling mechanism was applied to cool the patient to a core body temperature of 19.1 degrees centigrade. 19.1