Home Remove all but one duplicated row, when one column is different for all rows in R

Questions

Remove all but one duplicated row, when one column is different for all rows in R

December 7, 2022

I have a large data set with a few duplicate rows throughout. However, the duplicated rows are the same in all columns but one, making it hard to use dplyr duplicated() or unique(). As you can see below (short data example), the rows are almost identical except for the first column gene_ID, where the very end of the entry differs.

gene_ID	Gene_Identifier	Category	Length
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7	Wdfy1	Spliced	4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7	Wdfy1	Spliced	4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7	Wdfy1	Spliced	4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7	Wdfy1	Spliced	4551

I would like to remove all rows except for the top/first entry.

I have tried:

test <- aggregate(gene_ID ~ ., df, toString)

^^ this merged more rows than I was expecting (~4,000 vs ~ 50), so I am not sure if this is correct. I am currently going row by row to see if this actually does what I would like

test2 <- df %>% 
  group_by_at(vars(-gene_ID)) %>%
  filter(n() > 1)

^^^this doesn’t retain any of the duplicates, it removed all

test3 <- df %>% 
  group_by_at(vars(-gene_ID)) %>%
  duplicated(df)

^^^^ this errors: "Error: argument ‘incomparables != FALSE’ is not used (yet)"

>Solution :

We may need

df[!duplicated(df[-1]), , drop = FALSE]

-output

                                              gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7           Wdfy1  Spliced   4551

Or with dplyr

library(dplyr)
df %>%
   filter(!duplicated(across(-gene_ID)))

-output

                                                       gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7           Wdfy1  Spliced   4551

data

df <- structure(list(gene_ID = c("Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7"
), Gene_Identifier = c("Wdfy1", "Wdfy1", "Wdfy1", "Wdfy1"), Category = c("Spliced", 
"Spliced", "Spliced", "Spliced"), Length = c(4551L, 4551L, 4551L, 
4551L)), class = "data.frame", row.names = c(NA, -4L))