Home Generating True/False column in one dataframe based on multiple conditions in another dataframe in R

Questions

Generating True/False column in one dataframe based on multiple conditions in another dataframe in R

November 5, 2022

I have two dataframes of different length. The first looks like this and is the dataframe I want to add the True/False column to:

chr_snp    loc_snp     ma_snp
1          184319928   T
1          276998062   A
1          278255864   G
2          243012470   G
2          123072103   T
3          526785124   A

The second data frame is the reference dataframe that is smaller:

chr_QTL    loc_QTL     ma_QTL
1          281788173   G   
1          203085725   C
2          241577141   C

For each row in dataframe 1 (df1), I want to first check if the value of df1$chr_snp matches a value in df2$chr_QTL. If this match is true, then I want to determine if the value in df1$loc_snp is within 10 million units (these are DNA base-pairs) above OR below any values based on the first condition in df2$loc_QTL. Now, what is tricky is that for the first three rows of df1, there are three possible row matches in df2 (rows 1 and 2) based on the first criteria alone. However, only two match based on the second criteria (10M base-pairs greater than OR less than value in df2$loc_QTL). Note: df1$ma_snp and df2$ma_QTL can be totally ignored. So, based on these criteria, df1 should now look like:

chr_snp    loc_snp     ma_snp   Match
1          184319928   T        FALSE
1          276998062   A        TRUE
1          278255864   G        TRUE
2          243012470   G        TRUE
2          123072103   T        FALSE
3          526785124   A        FALSE

>Solution :

Here is one option. Join the data then look for any instance of < 10,000,000.

library(tidyverse)

left_join(df1 |>
            mutate(rw_id = row_number()), 
          df2, by = c("chr_snp" = "chr_QTL"))  |>
  mutate(less = abs(loc_snp -loc_QTL) < 10e6) |>
  group_by(rw_id)|>
  summarise(across(contains(colnames(df1)), ~.[[1]]),
            Match = any(less),
            Match = ifelse(is.na(Match), FALSE, Match))
#> # A tibble: 6 x 5
#>   rw_id chr_snp   loc_snp ma_snp Match
#>   <int>   <dbl>     <dbl> <chr>  <lgl>
#> 1     1       1 184319928 T      FALSE
#> 2     2       1 276998062 A      TRUE 
#> 3     3       1 278255864 G      TRUE 
#> 4     4       2 243012470 G      TRUE 
#> 5     5       2 123072103 T      FALSE
#> 6     6       3 526785124 A      FALSE

or another option:

library(tidyverse)

df1 |>
  mutate(Match = map2_lgl(chr_snp, loc_snp, 
                          \(x, y){
                            (x %in% df2$chr_QTL) && 
                              any(abs(df2[df2$chr_QTL == x, 'loc_QTL']-y) < 10e6)
                          }))
#> # A tibble: 6 x 4
#>   chr_snp   loc_snp ma_snp Match
#>     <dbl>     <dbl> <chr>  <lgl>
#> 1       1 184319928 T      FALSE
#> 2       1 276998062 A      TRUE 
#> 3       1 278255864 G      TRUE 
#> 4       2 243012470 G      TRUE 
#> 5       2 123072103 T      FALSE
#> 6       3 526785124 A      FALSE

just for fun. Here is a third option with base:

df1$Match <- apply(outer(df1$loc_snp, df2$loc_QTL, \(x,y) abs(x-y) < 10e6) &
 outer(df1$chr_snp, df2$chr_QTL, `==`), 1, any)
  
df1
#> # A tibble: 6 x 4
#>   chr_snp   loc_snp ma_snp Match
#>     <dbl>     <dbl> <chr>  <lgl>
#> 1       1 184319928 T      FALSE
#> 2       1 276998062 A      TRUE 
#> 3       1 278255864 G      TRUE 
#> 4       2 243012470 G      TRUE 
#> 5       2 123072103 T      FALSE
#> 6       3 526785124 A      FALSE