Home Merging a column existing in multiple datasets to the same column in target data

Questions

Merging a column existing in multiple datasets to the same column in target data

November 12, 2022

I have data as follows:

library(data.table)
all_questions <- fread("Variable_codes_2022 Variables_2022
              Cat1_1              This_question
              Cat1_2              Other_question
              Cat2_1              One_question
              Cat2_2              Another_question
              Cat3_1              Some_question
              Cat3_2              Extra_question
              Cat3_3              This_question
              Cat4_1              One_question
              Cat4_2              Wrong_question")

# I have other datasets that have the same questions with a key:

dat1 <- fread("Variable_codes Variables        Key
              Cat1              This_question   A1
              Cat1              Other_question  B3")

dat2 <- fread("Variable_codes Variables        Key
              Cat2              One_question    A7
              Cat2              Another_question C8")

I would like add the matching keys from dat1 and dat2 to all_questions, which I tried to do as follows:

all_questions <- merge(all_questions, dat1, by.x="Variables_2022", by.y="Variables", all.x=TRUE)

     Variables_2022 Variable_codes_2022 Variable_codes  Key
1: Another_question              Cat2_2           <NA> <NA>
2:   Extra_question              Cat3_2           <NA> <NA>
3:     One_question              Cat2_1           <NA> <NA>
4:     One_question              Cat4_1           <NA> <NA>
5:   Other_question              Cat1_2           Cat1   B3
6:    Some_question              Cat3_1           <NA> <NA>
7:    This_question              Cat1_1           Cat1   A1
8:    This_question              Cat3_3           Cat1   A1
9:   Wrong_question              Cat4_2           <NA> <NA>

The problem is that when I do it in this manner, when I try to merge the second data frame, the keys get spread over multiple columns with suffixes:

all_questions <- merge(all_questions, dat2, by.x="Variables_2022", by.y="Variables", all.x=TRUE)

     Variables_2022 Variable_codes_2022 Variable_codes.x Key.x Variable_codes.y Key.y
1: Another_question              Cat2_2             <NA>  <NA>             Cat2    C8
2:   Extra_question              Cat3_2             <NA>  <NA>             <NA>  <NA>
3:     One_question              Cat2_1             <NA>  <NA>             Cat2    A7
4:     One_question              Cat4_1             <NA>  <NA>             Cat2    A7
5:   Other_question              Cat1_2             Cat1    B3             <NA>  <NA>
6:    Some_question              Cat3_1             <NA>  <NA>             <NA>  <NA>
7:    This_question              Cat1_1             Cat1    A1             <NA>  <NA>
8:    This_question              Cat3_3             Cat1    A1             <NA>  <NA>
9:   Wrong_question              Cat4_2             <NA>  <NA>             <NA>  <NA>

How can I add the Keys of dat1 and dat2 to all_questions, in a way that this does not happen?

>Solution :

We may do a join

library(data.table)
all_questions[dat1, Key1 := i.Key, on = .(Variables_2022 = Variables)]
all_questions[dat2, Key2 := i.Key, on = .(Variables_2022 = Variables)]

-output

> all_questions
   Variable_codes_2022   Variables_2022   Key1   Key2
                <char>           <char> <char> <char>
1:              Cat1_1    This_question     A1   <NA>
2:              Cat1_2   Other_question     B3   <NA>
3:              Cat2_1     One_question   <NA>     A7
4:              Cat2_2 Another_question   <NA>     C8
5:              Cat3_1    Some_question   <NA>   <NA>
6:              Cat3_2   Extra_question   <NA>   <NA>
7:              Cat3_3    This_question     A1   <NA>
8:              Cat4_1     One_question   <NA>     A7
9:              Cat4_2   Wrong_question   <NA>   <NA>

If there are multiple datasets, do it in a loop and if it needs only a single ‘Key’ column

lst1 <- list(dat1, dat2)
all_questions[, Key := NA_character_]

for(i in seq_along(lst1)) all_questions[lst1[[i]], 
  Key := fcoalesce(i.Key, Key), on = .(Variables_2022 = Variables)]

-output

> all_questions
   Variable_codes_2022   Variables_2022    Key
                <char>           <char> <char>
1:              Cat1_1    This_question     A1
2:              Cat1_2   Other_question     B3
3:              Cat2_1     One_question     A7
4:              Cat2_2 Another_question     C8
5:              Cat3_1    Some_question   <NA>
6:              Cat3_2   Extra_question   <NA>
7:              Cat3_3    This_question     A1
8:              Cat4_1     One_question     A7
9:              Cat4_2   Wrong_question   <NA>