Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

In R, how can I validate whether values are valid according to an external table?

I have a reference table of acceptable column values, represented as good_df below. I want to use it to find out which values in the previously unseen test_df are not among the list of acceptable ones.

library(tidyverse)
good_df <- tribble(
  ~column, ~value,
  "col1", "a",
  "col1", "b",
  "col1", "c",
  "col1", "d",
  "col1", "e",
  "col2", "A",
  "col2", "B",
  "col2", "C"
)

set.seed(1)
test_df <- tibble(
  col1 = sample(letters[1:3], 4, T),
  col2 = c(sample(LETTERS[1:3], 3, T), "D"),  # The D is invalid.
  col3 = rnorm(4)
)
test_df
#> # A tibble: 4 Ă— 3
#>   col1  col2    col3
#>   <chr> <chr>  <dbl>
#> 1 a     A      0.330
#> 2 c     C     -0.820
#> 3 a     C      0.487
#> 4 b     D      0.738

Created on 2022-06-09 by the reprex package (v2.0.1)

My thought is to use pivot_longer to convert test_df to a format that matches good_df, and then use setdiff to see what remains after the valid rows are removed, which must be the invalid ones. Something along the lines of this below, producing expected_output.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

long_test_df <- pivot_longer(test_df, "MAGIC HAPPENS HERE") 

long_test_df %>% 
  select(column, value) %>% 
  setdiff(good_df)

expected_output <- tribble(
  ~column, ~value,
  "col2", "D",
)

I can’t seem to get pivot_longer to work as I’m intending it. Maybe it’s not designed to? I also thought of using the validate package, but I don’t see anything in the documentation about explicitly listing allowed or forbidden values in creating the rules, unless I missed something.

Also of course, in practice I have many more acceptable column/value pairs and many more columns in my data sets to validate.

I am open to alternatives to my approach and to alternative forms of storing the set of allowed column values. How may I achieve this? (Bonus points for pretty pipes and fewer named variables!)

Edit – Solution:

A modified version of Jonathan’s answer gives the fully general solution. (Also, I still prefer setdiff to anti_join here, although that’s just a personal preference.)

test_df %>% 
  pivot_longer(cols = any_of(good_df$column), names_to = "column", values_to = "value") %>% 
  select(column, value) %>% 
  anti_join(good_df)

>Solution :

Here is the code for pivot_longer, the missing values can be obtained with anti_join:

library(tidyverse)

test_df %>% 
  pivot_longer(c(col1, col2), names_to = "column", values_to = "value") %>% 
  anti_join(good_df)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading