I have a large data set with a few duplicate rows throughout. However, the duplicated rows are the same in all columns but one, making it hard to use dplyr duplicated() or unique(). As you can see below (short data example), the rows are almost identical except for the first column gene_ID, where the very end of the entry differs.
| gene_ID | Gene_Identifier | Category | Length |
|---|---|---|---|
| Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7 | Wdfy1 | Spliced | 4551 |
| Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7 | Wdfy1 | Spliced | 4551 |
| Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7 | Wdfy1 | Spliced | 4551 |
| Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7 | Wdfy1 | Spliced | 4551 |
I would like to remove all rows except for the top/first entry.
I have tried:
test <- aggregate(gene_ID ~ ., df, toString)
^^ this merged more rows than I was expecting (~4,000 vs ~ 50), so I am not sure if this is correct. I am currently going row by row to see if this actually does what I would like
test2 <- df %>%
group_by_at(vars(-gene_ID)) %>%
filter(n() > 1)
^^^this doesn’t retain any of the duplicates, it removed all
test3 <- df %>%
group_by_at(vars(-gene_ID)) %>%
duplicated(df)
^^^^ this errors: "Error: argument ‘incomparables != FALSE’ is not used (yet)"
>Solution :
We may need
df[!duplicated(df[-1]), , drop = FALSE]
-output
gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7 Wdfy1 Spliced 4551
Or with dplyr
library(dplyr)
df %>%
filter(!duplicated(across(-gene_ID)))
-output
gene_ID Gene_Identifier Category Length
1 Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7 Wdfy1 Spliced 4551
data
df <- structure(list(gene_ID = c("Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113515.7",
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113514.7",
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113513.7",
"Wdfy1_chr1_79702262_79776143(-)_transcript=ENSMUST00000113512.7"
), Gene_Identifier = c("Wdfy1", "Wdfy1", "Wdfy1", "Wdfy1"), Category = c("Spliced",
"Spliced", "Spliced", "Spliced"), Length = c(4551L, 4551L, 4551L,
4551L)), class = "data.frame", row.names = c(NA, -4L))