Gather overlapping coordinates columns within same groups in R

March 29, 2022

I have a dataframe such as

    Seq Chrm  start  end  length  score
0     A   C1      1   50      49     12
1     B   C1      3   55      52     12
2     C   C1      6   60      54     12
3  Cbis   C1      6   60      54     11
4     D   C1     70  120      50     12
5     E   C1     78  111      33     12
6     F   C2    350  400      50     12
7     A   C2    349  400      51     12
8     B   C2    450  500      50     12

And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.

For example in C1:

Seq    Chrm start end  length score
A      C1   1     50   49     12
B      C1   3     55   52     12
C      C1   6     60   54     12
Cbis   C1   6     60   54     11
D      C1   70    120  50     12
E      C1   78    111  33     12

Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.

In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.

If I do the same for other Chrm I should then get the following output:

Seq Chrm start end  length score
C   C1   6     60   54 12
D   C1   70    120  50 12
A   C2   349   400  51 12
B   C2   450   500  50 12

Here is the dataframe in dput format if it can help :

structure(list(Seq = c("A", "B", "C", "Cbis", "D", "E", "F", 
"A", "B"), Chrm = c("C1", "C1", "C1", "C1", "C1", "C1", "C2", 
"C2", "C2"), start = c(1L, 3L, 6L, 6L, 70L, 78L, 350L, 349L, 
450L), end = c(50L, 55L, 60L, 60L, 120L, 111L, 400L, 400L, 500L
), length = c(49L, 52L, 54L, 54L, 50L, 33L, 50L, 51L, 50L), score = c(12L, 
12L, 12L, 11L, 12L, 12L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, 
-9L))

>Solution :

Using tidyverse functions:

library(tidyverse)

dat %>% 
  group_by(Chrm) %>% 
  arrange(start, end) %>% 
  group_by(cum = head(c(0, cumsum((end < lead(start)) | (end > lead(start) & start > lead(end)))), -1)) %>%
  arrange(desc(length, score)) %>% 
  slice_head(n = 1)

  Seq   Chrm  start   end length score   cum
  <chr> <chr> <int> <int>  <int> <int> <dbl>
1 C     C1        6    60     54    12     0
2 D     C1       70   120     50    12     1
3 A     C2      349   400     51    12     2
4 B     C2      450   500     50    12     3