Compare Similar Columns in Data Frames, Replace Differences with NA

May 28, 2022

I’m attempting to write a function which compares the factor columns with the same columns names in two data frames. The function fails to return the correct result which should be NA’s in d2 columns c1 and c2 respectively for z and zz fields. The function should identify rows in data frame d2 column c1 and c2 not in data frame d1 column c1 and c2 respectively, replace these values with NA.

   `
    c1 <- c("A", "B", "C", "D", "E")
    c2 <- c("AA", "BB", "CC", "DD", "EE")
    d1 <- data.frame(c1, c2)                   

    c1 <- c("z", "B", "C", "D", "E")
    c2 <- c("AA", "zz", "CC", "DD", "EE")
    d2 <- data.frame(c1, c2)                  
 
    v       <- colnames(d1)
    replace <- NA
    x       <- d2[v]
        
    repFact = function(x, d1, replace){
         x1 <- unique(d1[,v])            
         y  <- x                                        
         id <- which(!(y %in% x1))         
         x[id, v] <- NA                           
         x
         return(x)
        }
    d2[v] <- lapply(d2[v], repFact, d1[v], replace) 
`

I’m using this R code to prepare prediction data and am attempting to remove unseen factor levels in d2, replacing them with NA or a seen factor level so the prediction function (Caret) does not fail.

Any ideas are appreciated, however, I’d like to retain the use of the which and lapply functions if possible.

>Solution :

We may use Map instead of lapply if we want to replace the corresponding column value in ‘d2’ based on the ‘d1’ column. Modified the repFact function as well

repFact <- function(x, y, replaceVal)
{
  replace(y, y %in% setdiff(y, x), replaceVal)
}

-testing

d2[v] <-  Map(repFact, d1[v], d2[v], MoreArgs = list(replaceVal = NA))
> d2
    c1   c2
1 <NA>   AA
2    B <NA>
3    C   CC
4    D   DD
5    E   EE

In addition, we can also use tidyverse to do this by mutateing across the columns specified in v for ‘d2’ and then apply the repFact with ‘d1’ corresponding column as in put (cur_column() – returns the column name)

library(dplyr)
d2 <- d2 %>%
   mutate(across(all_of(v),  ~ repFact(.x, d1[[cur_column()]], NA)))
d2
    c1   c2
1 <NA>   AA
2    B <NA>
3    C   CC
4    D   DD
5    E   EE