Home Match one dataframe based on a range in another dataframe in R tidyverse

Questions

Match one dataframe based on a range in another dataframe in R tidyverse

March 3, 2022

I have two large datasets that want to match with each other

library(tidyverse)

df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#>   position
#>      <dbl>
#> 1       10
#> 2       11
#> 3      200
#> 4      250
#> 5      300

df2 <- tibble(start=c(1,10,200,251),
              end=c(20,100,250,350),
              name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#>   start   end name 
#>   <dbl> <dbl> <chr>
#> 1     1    20 geneA
#> 2    10   100 geneB
#> 3   200   250 geneC
#> 4   251   350 geneD

^{Created on 2022-03-03 by the reprex package (v2.0.1)}

I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.

I want my data to look like this

  position start   end name 
     <dbl> <dbl> <dbl> <chr>
1       10     1    20 geneA
2       10    10   100 geneB
3       11     1    20 geneA
4       11    10   100 geneB
5      200   200   250 geneC
6      250   200   250 geneC
7      300   251   350 geneD

One way to solve this could be through crossing and filtering

df1 %>% 
  crossing(df2) %>% 
  filter(position >= start & position <= end)

However my dataset is way too large and can not afford crossing or expanding. Any other idea?

>Solution :

1) SQL engines can perform such operations without crossing. (It may be possible to speed it up even more if you add indexes.)

library(sqldf)

sqldf("select *
  from df1 a
  join df2 b on a.position between b.start and b.end")

2) data.table can also do some sql-like operations. (Be careful because the first variable in each comparison must be from the first data table and the second from the second. They can’t be reordered so, for example, the first comparison could not be written as position <= start even though it is mathematically the same.) Again, adding indexes may improve the speed.

library(data.table)

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)
dt2[dt1, on = .(start <= position, end >= position)]

tidyr

byMR

Published March 03, 2022

Add a comment

My shiny app does not render R code in markdown

byMR

March 3, 2022

Questions

Seaborn with multiple Columns

byMR

March 3, 2022

Questions

Java generic method Cannot resolve method 'getX()' in 'T'

byMR

March 3, 2022

Questions

How to print a list containing custom objects from a class?

byMR

March 3, 2022

Questions

Fisher score error – Length of value does not match length of index?

byMR

March 3, 2022

Questions

Using JavaScript Promise All – is this syntax valid?

byMR

March 3, 2022

Match one dataframe based on a range in another dataframe in R tidyverse

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

My shiny app does not render R code in markdown

Seaborn with multiple Columns

Java generic method Cannot resolve method 'getX()' in 'T'

How to print a list containing custom objects from a class?

Fisher score error – Length of value does not match length of index?

Using JavaScript Promise All – is this syntax valid?

Keep Up to Date with the Most Important News

Match one dataframe based on a range in another dataframe in R tidyverse

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

My shiny app does not render R code in markdown

Seaborn with multiple Columns

Java generic method Cannot resolve method 'getX()' in 'T'

How to print a list containing custom objects from a class?

Fisher score error – Length of value does not match length of index?

Using JavaScript Promise All – is this syntax valid?

Discover more from Dev solutions