Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compare Similar Columns in Data Frames, Replace Differences with NA

I’m attempting to write a function which compares the factor columns with the same columns names in two data frames. The function fails to return the correct result which should be NA’s in d2 columns c1 and c2 respectively for z and zz fields. The function should identify rows in data frame d2 column c1 and c2 not in data frame d1 column c1 and c2 respectively, replace these values with NA.

   `
    c1 <- c("A", "B", "C", "D", "E")
    c2 <- c("AA", "BB", "CC", "DD", "EE")
    d1 <- data.frame(c1, c2)                   

    c1 <- c("z", "B", "C", "D", "E")
    c2 <- c("AA", "zz", "CC", "DD", "EE")
    d2 <- data.frame(c1, c2)                  
 
    v       <- colnames(d1)
    replace <- NA
    x       <- d2[v]
        
    repFact = function(x, d1, replace){
         x1 <- unique(d1[,v])            
         y  <- x                                        
         id <- which(!(y %in% x1))         
         x[id, v] <- NA                           
         x
         return(x)
        }
    d2[v] <- lapply(d2[v], repFact, d1[v], replace) 
`

I’m using this R code to prepare prediction data and am attempting to remove unseen factor levels in d2, replacing them with NA or a seen factor level so the prediction function (Caret) does not fail.

Any ideas are appreciated, however, I’d like to retain the use of the which and lapply functions if possible.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

We may use Map instead of lapply if we want to replace the corresponding column value in ‘d2’ based on the ‘d1’ column. Modified the repFact function as well

repFact <- function(x, y, replaceVal)
{
  replace(y, y %in% setdiff(y, x), replaceVal)
}

-testing

d2[v] <-  Map(repFact, d1[v], d2[v], MoreArgs = list(replaceVal = NA))
> d2
    c1   c2
1 <NA>   AA
2    B <NA>
3    C   CC
4    D   DD
5    E   EE

In addition, we can also use tidyverse to do this by mutateing across the columns specified in v for ‘d2’ and then apply the repFact with ‘d1’ corresponding column as in put (cur_column() – returns the column name)

library(dplyr)
d2 <- d2 %>%
   mutate(across(all_of(v),  ~ repFact(.x, d1[[cur_column()]], NA)))
d2
    c1   c2
1 <NA>   AA
2    B <NA>
3    C   CC
4    D   DD
5    E   EE
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading