Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Several columns matching over 2 data.frames

I have two large, messy datasets. The structure of the variables for this issue roughly looks like this:

First, a dataset containing some IDs.

DF1 <- tibble(
  ID1 = c(1, 2, 3), 
  ID2 = c(4, 5, NA),
  ID3 = c(9, NA, NA)
)
    
> DF1
# A tibble: 3 × 3
    ID1   ID2   ID3
  <dbl> <dbl> <dbl>
1     1     4     9
2     2     5    NA
3     3    NA    NA

Secondly, a dataset containing IDs as well as titles connected to those IDs

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

DF2 <- tibble(
  ID1_2 = c(2, 4, 1), 
  ID2_2 = c(5, NA, NA), 
  ID3_2 = c(10, NA, NA), 
  Title = c("Journal X", "Journal Y", "Journal Z")
)

> DF2
# A tibble: 3 × 4
  ID1_2 ID2_2 ID3_2 Title    
  <dbl> <dbl> <dbl> <chr>    
1     2     5    10 Journal X
2     4    NA    NA Journal Y
3     1    NA    NA Journal Z

The IDs I’m working with might have several titles connected to them, and the titles might have several IDs connected as well.

What I am looking to do is to obtain a column containing titles (from Title in DF2) based upon whether any IDs in the columns ID1, ID2 or ID3 matches any of the IDs in ID1_2, ID2_2 or ID3_2. Thus, if any of the IDs in a row of DF1 matches any ID in DF2, the value of Match_titles should be the value of Title in DF2. However, there might me several matches, why this value should state each match, separated by a comma. To illustrate, the goal data should look like the following:

DF3 <- tibble(
  ID1 = c(1, 2, 3), 
  ID2 = c(4, 5, NA),
  ID3 = c(9, NA, NA), 
  Match_titles = c("Journal Z, Journal Y", 
                   "Journal X, Journal X" ,
                   NA)
)

> DF3
# A tibble: 3 × 4
    ID1   ID2   ID3 Match_titles        
  <dbl> <dbl> <dbl> <chr>               
1     1     4     9 Journal Z, Journal Y
2     2     5    NA Journal X, Journal X
3     3    NA    NA NA                  

I find it a bit challenging to explain, but I hope it is clear – any help is absolutely helpful 🙂

>Solution :

First let’s pivot DF2 longer:

df2_longer <- DF2 %>% pivot_longer(cols = c("ID1_2", "ID2_2", "ID3_2"))
# A tibble: 9 x 3
  Title     name  value
  <chr>     <chr> <dbl>
1 Journal X ID1_2     2
2 Journal X ID2_2     5
3 Journal X ID3_2    10
4 Journal Y ID1_2     4
5 Journal Y ID2_2    NA
6 Journal Y ID3_2    NA
7 Journal Z ID1_2     1
8 Journal Z ID2_2    NA
9 Journal Z ID3_2    NA

This then can be more easily done by:

res <- DF1 %>% group_by(ID1, ID2, ID3) %>%  
mutate(match_list = list(df2_longer$Title[df2_longer$value %in% c(ID1, ID2, ID3)[!is.na(c(ID1, ID2, ID3))]]) )

res$match_list

> res$match_list
[[1]]
[1] "Journal Y" "Journal Z"

[[2]]
[1] "Journal X" "Journal X"

[[3]]
character(0)

As you can see Match_list is a list of matching elements which we can now paste together.

res <- res %>% mutate(Match_titles = paste(unlist(match_list), collapse = ", "))
res

# A tibble: 3 x 5
# Groups:   ID1, ID2, ID3 [3]
    ID1   ID2   ID3 match_list Match_titles          
  <dbl> <dbl> <dbl> <list>     <chr>                 
1     1     4     9 <chr [2]>  "Journal Y, Journal Z"
2     2     5    NA <chr [2]>  "Journal X, Journal X"
3     3    NA    NA <chr [0]>  ""    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading