Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove all but one duplicated row, when one column is different for all rows in R

I have a large data set with a few duplicate rows throughout. However, the duplicated rows are the same in all columns but one, making it hard to use dplyr duplicated() or unique(). As you can see below (short data example), the rows are almost identical except for the first column gene_ID, where the very end of the entry differs.

gene_ID Gene_Identifier Category Length
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7 Wdfy1 Spliced 4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7 Wdfy1 Spliced 4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7 Wdfy1 Spliced 4551
Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7 Wdfy1 Spliced 4551

I would like to remove all rows except for the top/first entry.

I have tried:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

test <- aggregate(gene_ID ~ ., df, toString)

^^ this merged more rows than I was expecting (~4,000 vs ~ 50), so I am not sure if this is correct. I am currently going row by row to see if this actually does what I would like

test2 <- df %>% 
  group_by_at(vars(-gene_ID)) %>%
  filter(n() > 1)

^^^this doesn’t retain any of the duplicates, it removed all

test3 <- df %>% 
  group_by_at(vars(-gene_ID)) %>%
  duplicated(df)

^^^^ this errors: "Error: argument ‘incomparables != FALSE’ is not used (yet)"

>Solution :

We may need

df[!duplicated(df[-1]), , drop = FALSE]

-output

                                              gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7           Wdfy1  Spliced   4551

Or with dplyr

library(dplyr)
df %>%
   filter(!duplicated(across(-gene_ID)))

-output

                                                       gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7           Wdfy1  Spliced   4551

data

df <- structure(list(gene_ID = c("Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7", 
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7"
), Gene_Identifier = c("Wdfy1", "Wdfy1", "Wdfy1", "Wdfy1"), Category = c("Spliced", 
"Spliced", "Spliced", "Spliced"), Length = c(4551L, 4551L, 4551L, 
4551L)), class = "data.frame", row.names = c(NA, -4L))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading