So I have a data frame containing names of species and different DNA sequences for each species. Sometimes the sequences are different within the same species, other times they are similar, although with different sizes.
What I am trying to do is to remove the rows of the data frame that have sequences that are already contained in the sequences of other rows, provided they have a minimum size.
For example, in the example below, if a sequence has a size of at least 5 and is contained in another sequence, I want to delete that row and keep the row in the largest sequence:
Species | sequence | size
-----------------------------------------
Tilapia guineensis | AAATGGA | 7
Tilapia guineensis | AAATGGAATA |10
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCA |5
Sprattus sprattus | GTGCAATGC |9
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACTGACT |7
Eutrigla gurnardus | ACGAGTTTGCGAG|13
The output would be this data frame:
Species | sequence | size
--------------------------------------------
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACGAGTTTGCGAG|13
I have tried using dplyr to group the rows by Species, then use grep to find and remove the sequences that are contained in sequences of other rows. Unfortunately, I am not being able to subset the sequences:
library(dplyr)
df2<-df%>%
group_by(Species) %>%
df[-(grep(sequence[1:5],sequence)),]
I am getting this error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable
>Solution :
df %>%
group_by(Species) %>%
mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
sequence, row_number())) %>%
filter(size >= 5 & !inother) %>%
ungroup() %>%
select(-inother)
# # A tibble: 5 × 3
# Species sequence size
# <chr> <chr> <int>
# 1 Tilapia guineensis AAATGGAATAGAT 13
# 2 Tilapia guineensis TTATGGAGTAGA 12
# 3 Sprattus sprattus GTGCAATGCA 10
# 4 Eutrigla gurnardus ACTGACTGATCG 12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG 13
The mapply
code is iterating over each sequence
(as ptn
), looking to see if it is matched in another string. Because it will always match itself, I exclude its own value from the RHS of the grp using sequence[-rownum]
, where rownum
is provided by row_number()
within each group. This might be expanded with more advanced patterns to make sure it doesn’t match equal (but different-row) sequence
values, if needed, by updating pattern to be something like sprintf("(.%s|%s.)", ptn, ptn)
to look for at least one leading or trailing extra character.