Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn’t matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.

ID  | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG

I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice(1) %>%
  ungroup()

My output would be this (although the repeated sequences that are kept could be others):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID  | species  |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG

>Solution :

use slice_head(n=2):

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice_head(n=2) %>%
  ungroup()

df_filtered
# A tibble: 12 × 3
      ID species   sequence    
   <dbl> <chr>     <chr>       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading