Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Gather overlapping coordinates columns within same groups in R

I have a dataframe such as

    Seq Chrm  start  end  length  score
0     A   C1      1   50      49     12
1     B   C1      3   55      52     12
2     C   C1      6   60      54     12
3  Cbis   C1      6   60      54     11
4     D   C1     70  120      50     12
5     E   C1     78  111      33     12
6     F   C2    350  400      50     12
7     A   C2    349  400      51     12
8     B   C2    450  500      50     12

And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.

For example in C1:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Seq    Chrm start end  length score
A      C1   1     50   49     12
B      C1   3     55   52     12
C      C1   6     60   54     12
Cbis   C1   6     60   54     11
D      C1   70    120  50     12
E      C1   78    111  33     12
 

Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.

In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.

If I do the same for other Chrm I should then get the following output:

Seq Chrm start end  length score
C   C1   6     60   54 12
D   C1   70    120  50 12
A   C2   349   400  51 12
B   C2   450   500  50 12

Here is the dataframe in dput format if it can help :

structure(list(Seq = c("A", "B", "C", "Cbis", "D", "E", "F", 
"A", "B"), Chrm = c("C1", "C1", "C1", "C1", "C1", "C1", "C2", 
"C2", "C2"), start = c(1L, 3L, 6L, 6L, 70L, 78L, 350L, 349L, 
450L), end = c(50L, 55L, 60L, 60L, 120L, 111L, 400L, 400L, 500L
), length = c(49L, 52L, 54L, 54L, 50L, 33L, 50L, 51L, 50L), score = c(12L, 
12L, 12L, 11L, 12L, 12L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, 
-9L))

>Solution :

Using tidyverse functions:

library(tidyverse)

dat %>% 
  group_by(Chrm) %>% 
  arrange(start, end) %>% 
  group_by(cum = head(c(0, cumsum((end < lead(start)) | (end > lead(start) & start > lead(end)))), -1)) %>%
  arrange(desc(length, score)) %>% 
  slice_head(n = 1)

  Seq   Chrm  start   end length score   cum
  <chr> <chr> <int> <int>  <int> <int> <dbl>
1 C     C1        6    60     54    12     0
2 D     C1       70   120     50    12     1
3 A     C2      349   400     51    12     2
4 B     C2      450   500     50    12     3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading