I have a dataframe with speech data, like this:
df <- data.frame(
id = 1:12,
partcl = c("yeah yeah yeah absolutely", "well you know it 's", "oh well yeah that's right",
"yeah I mean well oh", "well erm well Peter will be there", "well yeah well",
"yes yes yes totally", "yeah yeah yeah yeah", "well well I did n't do it",
"er well yeah that 's true", "oh hey where 's he gone?", "er"
))
and a vector with key words called parts:
parts <- c("yeah", "oh", "no", "well", "mm", "yes", "so", "right", "er", "like")
What I need to do is filter those rows with at least two distinct parts values. What I can do is filter those rows with at least two parts values, regardless of whether they’re distinct or the same:
library(dplyr)
df %>%
filter(
str_count(partcl, paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")) > 1
)
id partcl
1 1 yeah yeah yeah absolutely
2 3 oh well yeah that's right
3 4 yeah I mean well oh
4 5 well erm well Peter will be there
5 6 well yeah well
6 7 yes yes yes totally
7 8 yeah yeah yeah yeah
8 9 well well I did n't do it
9 10 er well yeah that 's true
How can I assert that the matched partsbe distinct so that the result is this:
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
>Solution :
May be this helps – extract the key words with str_extract_all, and then do the check with n_distinct to filter rows having more than one unique keyword
library(dplyr)
library(stringr)
library(purrr)
df %>%
filter(map_lgl(str_extract_all(partcl,
paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")),
~ n_distinct(.x) > 1))
-output
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true