Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Inner_join in R with duplicates?

I have two tibbles with a list of names and their scores on one of two tests. I want to merge these dataframes so each row consists of a name, and their scores on test one and test two.

df1<-tibble(person=c("Joe", "John", "Albert", "Joe", "John"), test_1=c(90, 91, 92,93, 94 ))
df2<-tibble(person=c("Joe", "John", "Albert", "Joe"), test_2=c(94, 70, 93,50))

df3<-inner_join(df1, df2, by="person")

I would like to get this:

df4<-tibble(person=c("Joe", "Albert", "Joe"), test_1=c(90, 92, 93), test_2=c(94, 93, 94)

However, when I run the first code block, I get four different rows for person "Joe" even though there are only two observations of "Joe" in each dataframe. I only want to "match" two persons with the same name (like "Joe") if their test scores are within five points of each other.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a way to do this? I would appreciate any help!

Note: I accidentally rollbacked this post when trying to edit, which removed someone’s answer so I am reposting it. Very sorry to the person whose answer was removed!

>Solution :

Given your expected output then my original post would work. Just filter after doing a many-to-many join:

library(dplyr)

inner_join(df1, df2, by = "person", relationship = 'many-to-many') |>
  filter(abs(test_1 - test_2) <= 5 | n() == 1, .by = person)

the .by is doing the filter by person. So then n() represents the number of rows within each by group. This additional condition is ensuring that rows with a one-to-one match will be returned no matter how far apart test_1 and test_2 are. For example, if Albert had a test_1 score of 92 and a test_2 score of 0 that row would not be filtered out because there is only one Albert.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading