Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Inner join data.frames using multiple arguments in R dplyr

I have two data frames that look like this

library(dplyr)

#> Attaching package: 'dplyr'


df1 <- tibble(chrom=c(1,1,1,2,2,2),
              start=c(100,200,300,100,200,300),
              end=c(150,250,350,120,220,320))

df2 <- tibble(chrom=c(1,1,1,2,2,2),
              start2=c(100,50,280,100,10,200),
              end2=c(125,100,320,115,15,350))

df1 
#> # A tibble: 6 × 3
#>   chrom start   end
#>   <dbl> <dbl> <dbl>
#> 1     1   100   150
#> 2     1   200   250
#> 3     1   300   350
#> 4     2   100   120
#> 5     2   200   220
#> 6     2   300   320
df2
#> # A tibble: 6 × 3
#>   chrom start2  end2
#>   <dbl>  <dbl> <dbl>
#> 1     1    100   125
#> 2     1     50   100
#> 3     1    280   320
#> 4     2    100   115
#> 5     2     10    15
#> 6     2    200   350

Created on 2023-01-09 with reprex v2.0.2

When I am trying to join based on the following conditions I am taking a huge error
any ideas why

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

inner_join(df2, df1, by = join_by(chrom, start< end2, end > start2))

The error

Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `start` and `end`.
Run `rlang::last_error()` to see where the error occurred.

This is the expected outcome

  chrom start end start2 end2
1:     1   100 150    100  125
2:     1    NA  NA     50  100
3:     1   300 350    280  320
4:     2   100 120    100  115
5:     2    NA  NA     10   15
6:     2   200 220    200  350
7:     2   300 320    200  350

>Solution :

You list df2 first in the inner_join, its variables need to be listed on the LHS of the comparisons.

You can either swap df1/df2 or swap the order of the comparison variables (effectively the same given an inner join):

inner_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 5 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1    280   320   300   350
# 3     2    100   115   100   120
# 4     2    200   350   200   220
# 5     2    200   350   300   320

inner_join(df1, df2, by = join_by(chrom, start < end2, end > start2))
# # A tibble: 5 × 5
#   chrom start   end start2  end2
#   <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1     1   100   150    100   125
# 2     1   300   350    280   320
# 3     2   100   120    100   115
# 4     2   200   220    200   350
# 5     2   300   320    200   350

Your expected output suggests a sided-join, where these two are effectively equivalent:

left_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 7 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1     50   100    NA    NA
# 3     1    280   320   300   350
# 4     2    100   115   100   120
# 5     2     10    15    NA    NA
# 6     2    200   350   200   220
# 7     2    200   350   300   320
right_join(df1, df2, by = join_by(chrom, start < end2, end > start2))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading