Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Subset a dataset based on the appearance of values in a column in R

I have a dataset (ds) with two columns. There are either one or two entries with the same number in "match". "status" is a binary variable. There are pairs, for example, the value 12 in match appears twice, one for where status is 1 and 0 for the other. Yet, there are also observations in match who do not have a partner, in this dataset it would be 3,8,33,17 who have no partner.

match     status
12          1 
3           1
5           0
8           1
33          0
5           1
12          0
17          0

What I want to do is to create a new dataset that only contains observations of pairs (thus if a value appears twice). In my example, it would be

match     status
12          1
12          0
5           0
5           1

The status variable in the final dataset would be 50/50 because a value in match (for example 12) has an observation where status = 0 and one where status = 1.
The actual dataset I´m working with has over 50k observations so I cannot just search and filter by each number. What I tried is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

numbers <- table(ds$match)
numbers <- as.data.frame(numbers)
numbers <- numbers[numbers$Freq == 2,]
numbers <- numbers$Var1
ds$keep <- ifelse(numbers %in% ds$match, 1, 0) 

Here I get the error "replacement has 23005 rows, data has 39021" If I could get around this error I think I could just run

ds <- filter(ds, ds$keep == 1)

to get the dataset that I want. This was my most promising approach. I tried a few other things but it always came done to the fact that the status variable wasn´t 50/50 so I couldn´t manage to exclude all observations without a pair. Does someone have an idea how I could fix my code or is there a solution that would be quicker/more smooth? Thanks for any help in advance!

>Solution :

library(dplyr)

ds %>% group_by(match) %>% filter(n()>1) %>% arrange(match,status)

  match status
  <dbl>  <dbl>
1     5      0
2     5      1
3    12      0
4    12      1

You can also do something like this:

ds <- ds[order(ds$match),]
id = rle(ds$match)
ds[ds$match %in% id$values[id$lengths>1],]

  match status
  <dbl>  <dbl>
1     5      0
2     5      1
3    12      1
4    12      0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading