I have a reference table of acceptable column values, represented as good_df below. I want to use it to find out which values in the previously unseen test_df are not among the list of acceptable ones.
library(tidyverse)
good_df <- tribble(
~column, ~value,
"col1", "a",
"col1", "b",
"col1", "c",
"col1", "d",
"col1", "e",
"col2", "A",
"col2", "B",
"col2", "C"
)
set.seed(1)
test_df <- tibble(
col1 = sample(letters[1:3], 4, T),
col2 = c(sample(LETTERS[1:3], 3, T), "D"), # The D is invalid.
col3 = rnorm(4)
)
test_df
#> # A tibble: 4 Ă— 3
#> col1 col2 col3
#> <chr> <chr> <dbl>
#> 1 a A 0.330
#> 2 c C -0.820
#> 3 a C 0.487
#> 4 b D 0.738
Created on 2022-06-09 by the reprex package (v2.0.1)
My thought is to use pivot_longer to convert test_df to a format that matches good_df, and then use setdiff to see what remains after the valid rows are removed, which must be the invalid ones. Something along the lines of this below, producing expected_output.
long_test_df <- pivot_longer(test_df, "MAGIC HAPPENS HERE")
long_test_df %>%
select(column, value) %>%
setdiff(good_df)
expected_output <- tribble(
~column, ~value,
"col2", "D",
)
I can’t seem to get pivot_longer to work as I’m intending it. Maybe it’s not designed to? I also thought of using the validate package, but I don’t see anything in the documentation about explicitly listing allowed or forbidden values in creating the rules, unless I missed something.
Also of course, in practice I have many more acceptable column/value pairs and many more columns in my data sets to validate.
I am open to alternatives to my approach and to alternative forms of storing the set of allowed column values. How may I achieve this? (Bonus points for pretty pipes and fewer named variables!)
Edit – Solution:
A modified version of Jonathan’s answer gives the fully general solution. (Also, I still prefer setdiff to anti_join here, although that’s just a personal preference.)
test_df %>%
pivot_longer(cols = any_of(good_df$column), names_to = "column", values_to = "value") %>%
select(column, value) %>%
anti_join(good_df)
>Solution :
Here is the code for pivot_longer, the missing values can be obtained with anti_join:
library(tidyverse)
test_df %>%
pivot_longer(c(col1, col2), names_to = "column", values_to = "value") %>%
anti_join(good_df)