Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Generating True/False column in one dataframe based on multiple conditions in another dataframe in R

I have two dataframes of different length. The first looks like this and is the dataframe I want to add the True/False column to:

chr_snp    loc_snp     ma_snp
1          184319928   T
1          276998062   A
1          278255864   G
2          243012470   G
2          123072103   T
3          526785124   A

The second data frame is the reference dataframe that is smaller:

chr_QTL    loc_QTL     ma_QTL
1          281788173   G   
1          203085725   C
2          241577141   C

For each row in dataframe 1 (df1), I want to first check if the value of df1$chr_snp matches a value in df2$chr_QTL. If this match is true, then I want to determine if the value in df1$loc_snp is within 10 million units (these are DNA base-pairs) above OR below any values based on the first condition in df2$loc_QTL. Now, what is tricky is that for the first three rows of df1, there are three possible row matches in df2 (rows 1 and 2) based on the first criteria alone. However, only two match based on the second criteria (10M base-pairs greater than OR less than value in df2$loc_QTL). Note: df1$ma_snp and df2$ma_QTL can be totally ignored. So, based on these criteria, df1 should now look like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

chr_snp    loc_snp     ma_snp   Match
1          184319928   T        FALSE
1          276998062   A        TRUE
1          278255864   G        TRUE
2          243012470   G        TRUE
2          123072103   T        FALSE
3          526785124   A        FALSE

>Solution :

Here is one option. Join the data then look for any instance of < 10,000,000.

library(tidyverse)

left_join(df1 |>
            mutate(rw_id = row_number()), 
          df2, by = c("chr_snp" = "chr_QTL"))  |>
  mutate(less = abs(loc_snp -loc_QTL) < 10e6) |>
  group_by(rw_id)|>
  summarise(across(contains(colnames(df1)), ~.[[1]]),
            Match = any(less),
            Match = ifelse(is.na(Match), FALSE, Match))
#> # A tibble: 6 x 5
#>   rw_id chr_snp   loc_snp ma_snp Match
#>   <int>   <dbl>     <dbl> <chr>  <lgl>
#> 1     1       1 184319928 T      FALSE
#> 2     2       1 276998062 A      TRUE 
#> 3     3       1 278255864 G      TRUE 
#> 4     4       2 243012470 G      TRUE 
#> 5     5       2 123072103 T      FALSE
#> 6     6       3 526785124 A      FALSE

or another option:

library(tidyverse)

df1 |>
  mutate(Match = map2_lgl(chr_snp, loc_snp, 
                          \(x, y){
                            (x %in% df2$chr_QTL) && 
                              any(abs(df2[df2$chr_QTL == x, 'loc_QTL']-y) < 10e6)
                          }))
#> # A tibble: 6 x 4
#>   chr_snp   loc_snp ma_snp Match
#>     <dbl>     <dbl> <chr>  <lgl>
#> 1       1 184319928 T      FALSE
#> 2       1 276998062 A      TRUE 
#> 3       1 278255864 G      TRUE 
#> 4       2 243012470 G      TRUE 
#> 5       2 123072103 T      FALSE
#> 6       3 526785124 A      FALSE

just for fun. Here is a third option with base:

df1$Match <- apply(outer(df1$loc_snp, df2$loc_QTL, \(x,y) abs(x-y) < 10e6) &
 outer(df1$chr_snp, df2$chr_QTL, `==`), 1, any)
  
df1
#> # A tibble: 6 x 4
#>   chr_snp   loc_snp ma_snp Match
#>     <dbl>     <dbl> <chr>  <lgl>
#> 1       1 184319928 T      FALSE
#> 2       1 276998062 A      TRUE 
#> 3       1 278255864 G      TRUE 
#> 4       2 243012470 G      TRUE 
#> 5       2 123072103 T      FALSE
#> 6       3 526785124 A      FALSE
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading