I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn’t matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.
ID | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG
I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?
library(dplyr)
df_filtered <- df %>%
group_by(species, sequence) %>%
slice(1) %>%
ungroup()
My output would be this (although the repeated sequences that are kept could be others):
ID | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG
>Solution :
use slice_head(n=2):
library(dplyr)
df_filtered <- df %>%
group_by(species, sequence) %>%
slice_head(n=2) %>%
ungroup()
df_filtered
# A tibble: 12 × 3
ID species sequence
<dbl> <chr> <chr>
1 1 Species A ATGTAGCTCAGC
2 2 Species A ATGTAGCTCAGC
3 5 Species B AAACGGCCAATC
4 4 Species B CGCGCGATATTA
5 6 Species C TGTCGGCTCGTC
6 7 Species D ATGTAGCTCAGC
7 10 Species E AACTCTATATAT
8 8 Species E GCGCGGAGATTT
9 9 Species E GCGCGGAGATTT
10 11 Species F ATCGTAGCCTTG
11 13 Species F ATCGTAGCCTTG
12 12 Species F GGGCGCGCGGCG