I want to identify duplicate rows in a data frame based on two types of conditions:
1: all(multiple columns)
, all the elements in the multiple columns should be the same.
2: any(multiple columns)
, if only one of the elements in the multiple columns is the same, then they are considered as replicates.
getRepplicate <- function(df, allCol = "", anyCol = "") {
# both condition 1 and condition 2 are fit, then they are considered as replicated rows
}
For example:
df <- data.frame(
a = c(1, 1, 2, 3, 4, 1, 1, 3),
b = c(1, 2, 2, 3, 4, 1, 1, 3),
d = c("x", "y", "z", "x", "x", "y", "x", "x"),
e = c("x", "y", "z", "x", "x", "x", "z", "x")
)
> df
a b d e
1 1 1 x x
2 1 2 y y
3 2 2 z z
4 3 3 x x
5 4 4 x x
6 1 1 y x
7 1 1 x z
8 3 3 x x
If I apply this function df2 <- getRepplicate(df, allCol = c("a", "b"), anyCol = c("d", "e"))
, my expected result will be:
> df2
a b d e isReplicate
1 1 1 x x TRUE
2 1 2 y y FALSE
3 2 2 z z FALSE
4 3 3 x x TRUE
5 4 4 x x FALSE
6 1 1 y x TRUE
7 1 1 x z TRUE
8 3 3 x x TRUE
Thanks for your help.
>Solution :
Is something like this?
- In
allCond
we check if the number of unique values is one, so all values are the same - In
anyCond
we check if the number of unique values is equal to the number of values in the row - If both conditions are TRUE then it is replicated.
library(dplyr)
df <- data.frame(
a = c(1, 1, 2, 3, 4, 1, 1, 3,1),
b = c(1, 2, 2, 3, 4, 1, 1, 3,2),
d = c("x", "y", "z", "x", "x", "y", "x", "x","x"),
e = c("x", "y", "z", "x", "x", "x", "z", "x","y")
)
getRepplicate <- function(df, allCol, anyCol) {
df %>%
rowwise() %>%
mutate(
allCond = n_distinct(c_across(all_of(allCol))) == 1 ,
anyCond = n_distinct(c_across(all_of(anyCol))) < length(c_across(all_of(anyCol))) ) %>%
ungroup() %>%
mutate(isReplicated = allCond & anyCond)
}
getRepplicate(df, allCol = c("a", "b"), anyCol = c("d", "e"))
#> # A tibble: 9 × 7
#> a b d e allCond anyCond isReplicated
#> <dbl> <dbl> <chr> <chr> <lgl> <lgl> <lgl>
#> 1 1 1 x x TRUE TRUE TRUE
#> 2 1 2 y y FALSE TRUE FALSE
#> 3 2 2 z z TRUE TRUE TRUE
#> 4 3 3 x x TRUE TRUE TRUE
#> 5 4 4 x x TRUE TRUE TRUE
#> 6 1 1 y x TRUE FALSE FALSE
#> 7 1 1 x z TRUE FALSE FALSE
#> 8 3 3 x x TRUE TRUE TRUE
#> 9 1 2 x y FALSE FALSE FALSE
Created on 2023-02-10 with reprex v2.0.2