How to remove rows of a data frame that include a sub-string that is already contained within another string?

So I have a data frame containing names of species and different DNA sequences for each species. Sometimes the sequences are different within the same species, other times they are similar, although with different sizes.

What I am trying to do is to remove the rows of the data frame that have sequences that are already contained in the sequences of other rows, provided they have a minimum size.

For example, in the example below, if a sequence has a size of at least 5 and is contained in another sequence, I want to delete that row and keep the row in the largest sequence:

 Species           |  sequence      | size
-----------------------------------------
Tilapia guineensis |   AAATGGA      | 7
Tilapia guineensis |   AAATGGAATA   |10
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12     
Sprattus sprattus  |   GTGCA        |5 
Sprattus sprattus  |   GTGCAATGC    |9
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12
Eutrigla gurnardus |   ACTGACT      |7  
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

The output would be this data frame:

 Species           |  sequence      | size
--------------------------------------------
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12        
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12 
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

I have tried using dplyr to group the rows by Species, then use grep to find and remove the sequences that are contained in sequences of other rows. Unfortunately, I am not being able to subset the sequences:

library(dplyr)
df2<-df%>% 
  group_by(Species) %>% 
    df[-(grep(sequence[1:5],sequence)),]

I am getting this error:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable

>Solution :

df %>%
  group_by(Species) %>%
  mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
                          sequence, row_number())) %>%
  filter(size >= 5 & !inother) %>%
  ungroup() %>%
  select(-inother)
# # A tibble: 5 × 3
#   Species            sequence       size
#   <chr>              <chr>         <int>
# 1 Tilapia guineensis AAATGGAATAGAT    13
# 2 Tilapia guineensis TTATGGAGTAGA     12
# 3 Sprattus sprattus  GTGCAATGCA       10
# 4 Eutrigla gurnardus ACTGACTGATCG     12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG    13

The mapply code is iterating over each sequence (as ptn), looking to see if it is matched in another string. Because it will always match itself, I exclude its own value from the RHS of the grp using sequence[-rownum], where rownum is provided by row_number() within each group. This might be expanded with more advanced patterns to make sure it doesn’t match equal (but different-row) sequence values, if needed, by updating pattern to be something like sprintf("(.%s|%s.)", ptn, ptn) to look for at least one leading or trailing extra character.

Leave a Reply