Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Counting consecutive rowwise duplicates

I am currently working with data which has a structure like the following

ID Var1 Var2 Var3 Var4 Var5 Var6
1 1001 1001 1001 2002 2002 2002
2 2002 1001 1001 1001 NA 9999
3 1001 NA 1001 1001 9999 1234
4 3003 3003 3003 NA 2002 2002
5 NA NA NA 1001 1001 1001
6 3003 3003 3003 3003 3003 3003
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
                 Var1 = c(1001, 2002, 1001, 3003,   NA, 3003),
                 Var2 = c(1001, 1001,   NA, 3003,   NA, 3003),
                 Var3 = c(1001, 1001, 1001, 3003,   NA, 3003),
                 Var4 = c(2002, 1001, 1001,   NA, 1001, 3003),
                 Var5 = c(2002,   NA, 9999, 2002, 1001, 3003),
                 Var6 = c(2002, 9999, 1234, 2002, 1001, 3003))

My goal is to count—for each row—how many times a unique value has been duplicated consecutively at least n times. Let’s say n = 3. So, if "1001" repeats rowwise in at least 3 adjacent cells (could be more), then that counts as one duplicate. If both "1001" and "2002" have been repeated in a row at least three times each consecutively, then that would be two duplicates, etc. The following would be the intended result:

ID Var1 Var2 Var3 Var4 Var5 Var6 Num_3ConsecutiveDuplications
1 1001 1001 1001 2002 2002 2002 2
2 2002 1001 1001 1001 NA 9999 1
3 1001 NA 1001 1001 9999 1234 0
4 3003 3003 3003 NA 2002 2002 1
5 NA NA NA 1001 1001 1001 1
6 3003 3003 3003 3003 3003 3003 1

Since in the actual data there are a significant range of possible values each variable can take on, and that there are over 40 variables, counting this by conditioning on each possible value and range of columns seems infeasible/inefficient. Any suggestions would be greatly appreciated.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use rle():

library(dplyr)

df %>%
  rowwise() %>%
  mutate(x = sum(rle(across(-ID))$lengths >= 3))

# A tibble: 6 x 8
# Rowwise: 
     ID  Var1  Var2  Var3  Var4  Var5  Var6     x
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1     1  1001  1001  1001  2002  2002  2002     2
2     2  2002  1001  1001  1001    NA  9999     1
3     3  1001    NA  1001  1001  9999  1234     0
4     4  3003  3003  3003    NA  2002  2002     1
5     5    NA    NA    NA  1001  1001  1001     1
6     6  3003  3003  3003  3003  3003  3003     1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading