Home How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

Questions

How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

byMR

May 31, 2023

I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn’t matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.

ID  | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG

I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice(1) %>%
  ungroup()

My output would be this (although the repeated sequences that are kept could be others):

ID  | species  |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG

>Solution :

use slice_head(n=2):

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice_head(n=2) %>%
  ungroup()

df_filtered
# A tibble: 12 × 3
      ID species   sequence    
   <dbl> <chr>     <chr>       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG