Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to outer merge 3 or more datasets based on an id and compare them using one column?

Let’s say we have three datasets with three different years:

ID Text Year
101 abc 1990
102 abd 1990
103 a 1990

And the second dataset that could (or not) contain the IDs from the first year:

ID Text Year
104 bc 1991
101 abc 1991
102 abe 1991

And the third dataset:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID Text Year
104 bc 1992
105 a 1992

I want somehow to merge these three dataframes + add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not + also update the year if we notice there’s similarity between text from Year1 and Year2.

Here is the result I want:

ID Text Year Similar
101 abc 1991 true
102 abe 1991 false
103 a 1990 false
104 bc 1992 true
105 a 1992 false

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similar column.
The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let’s say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).

Thanks

>Solution :

You can use a groupby.agg with lambda g: g.nunique()!=len(g) (if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all() for the column Similar:

dfs = [df1, df2, df3]

out = (pd.concat(dfs)
         .groupby('ID', as_index=False)
         .agg(**{'Text': ('Text', 'last'),
                 'Year': ('Year', 'last'),
                 'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
                })
      )

Output:

    ID Text  Year  Similar
0  101  abc  1991     True
1  102  abe  1991    False
2  103    a  1990    False
3  104   bc  1992     True
4  105    a  1992    False
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading