Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R how to identify duplicate rows in all(multiple columns) and or(multiple columns)?

I want to identify duplicate rows in a data frame based on two types of conditions:

1: all(multiple columns), all the elements in the multiple columns should be the same.

2: any(multiple columns), if only one of the elements in the multiple columns is the same, then they are considered as replicates.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

getRepplicate <- function(df, allCol = "", anyCol = "") {
# both condition 1 and condition 2 are fit, then they are considered as replicated rows
}

For example:

df <- data.frame(
  a = c(1, 1, 2, 3, 4, 1, 1, 3),
  b = c(1, 2, 2, 3, 4, 1, 1, 3), 
  d = c("x", "y", "z", "x", "x", "y", "x", "x"),
  e = c("x", "y", "z", "x", "x", "x", "z", "x")
)
> df
  a b d e
1 1 1 x x
2 1 2 y y
3 2 2 z z
4 3 3 x x
5 4 4 x x
6 1 1 y x
7 1 1 x z
8 3 3 x x

If I apply this function df2 <- getRepplicate(df, allCol = c("a", "b"), anyCol = c("d", "e")), my expected result will be:

> df2
  a b d e isReplicate
1 1 1 x x TRUE
2 1 2 y y FALSE
3 2 2 z z FALSE
4 3 3 x x TRUE
5 4 4 x x FALSE
6 1 1 y x TRUE
7 1 1 x z TRUE
8 3 3 x x TRUE

Thanks for your help.

>Solution :

Is something like this?

  • In allCond we check if the number of unique values is one, so all values are the same
  • In anyCond we check if the number of unique values is equal to the number of values in the row
  • If both conditions are TRUE then it is replicated.
library(dplyr)

df <- data.frame(
  a = c(1, 1, 2, 3, 4, 1, 1, 3,1),
  b = c(1, 2, 2, 3, 4, 1, 1, 3,2), 
  d = c("x", "y", "z", "x", "x", "y", "x", "x","x"),
  e = c("x", "y", "z", "x", "x", "x", "z", "x","y")
)

getRepplicate <- function(df, allCol, anyCol) {
    df %>% 
    rowwise() %>% 
    mutate(
      allCond = n_distinct(c_across(all_of(allCol))) == 1 ,
      anyCond = n_distinct(c_across(all_of(anyCol))) < length(c_across(all_of(anyCol))) ) %>%
    ungroup() %>% 
    mutate(isReplicated = allCond & anyCond)
}
getRepplicate(df, allCol = c("a", "b"), anyCol = c("d", "e"))
#> # A tibble: 9 × 7
#>       a     b d     e     allCond anyCond isReplicated
#>   <dbl> <dbl> <chr> <chr> <lgl>   <lgl>   <lgl>       
#> 1     1     1 x     x     TRUE    TRUE    TRUE        
#> 2     1     2 y     y     FALSE   TRUE    FALSE
#> 3     2     2 z     z     TRUE    TRUE    TRUE        
#> 4     3     3 x     x     TRUE    TRUE    TRUE        
#> 5     4     4 x     x     TRUE    TRUE    TRUE        
#> 6     1     1 y     x     TRUE    FALSE   FALSE
#> 7     1     1 x     z     TRUE    FALSE   FALSE
#> 8     3     3 x     x     TRUE    TRUE    TRUE        
#> 9     1     2 x     y     FALSE   FALSE   FALSE

Created on 2023-02-10 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading