I have a large data frame that looks like this. I want to find which genes match the others based on an overlap between the start and end positions.
library(tidyverse)
data <- data.frame(group=c(1,1,1,2,2,2),
genes=c("A","B","C","D","E","F"),
start=c(1000,2000,3000,800,400,2000),
end=c(1500,2500,3500,1200,500,10000))
data
#> group genes start end
#> 1 1 A 1000 1500
#> 2 1 B 2000 2500
#> 3 1 C 3000 3500
#> 4 2 D 800 1200
#> 5 2 E 400 500
#> 6 2 F 2000 10000
Created on 2022-12-05 with reprex v2.0.2
I want something like this.
data
#> group genes start end match
#> 1 1 A 1000 1500 A-D
#> 2 1 B 2000 2500 B-F
#> 3 1 C 3000 3500 C-F
#> 4 2 D 800 1200 A-D
#> 5 2 E 400 500 NA
#> 6 2 F 2000 10000 F-C-B
I am a bit lost on how to start.
Any help is appreciated
>Solution :
With devel version of dplyr, we can use
library(dplyr)
library(stringr)
by <- join_by(overlaps(x$start, x$end, y$start, y$end))
full_join(data, data, by) %>%
group_by(genes= genes.x) %>%
summarise(match = if(n() ==1) NA_character_ else
str_c(genes.y, collapse = '-')) %>%
left_join(data, .)
-output
group genes start end match
1 1 A 1000 1500 A-D
2 1 B 2000 2500 B-F
3 1 C 3000 3500 C-F
4 2 D 800 1200 A-D
5 2 E 400 500 <NA>
6 2 F 2000 10000 B-C-F