Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

dataframe concatenation retain first value for duplicates

I have 2 dataframes (say d1 and d2) both with columns c1 and c2. I’d like to concatenate them. However, for all values that occur in column c1 in both tables I’d like to retain only the row from d1.

Example:

df1 = pd.DataFrame({"Customer":["Alice", "Bob", "John"],
                    "Status":["closed","in-progress","closed"]})

df2 = pd.DataFrame({"Customer":["Alice", "Lara", "Santa"],
                    "Status":["in-progress","in-progress","closed"]})

desired_result = pd.DataFrame({"Customer":["Alice", "Bob", "John", "Lara", "Santa"],
                    "Status":["closed","in-progress","closed","in-progress","closed"]})

d1:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  Customer       Status
0    Alice       closed
1      Bob  in-progress
2     John       closed

d2:

  Customer       Status
0    Alice  in-progress
1     Lara  in-progress
2    Santa       closed

desired_result:

  Customer       Status
0    Alice       closed
1      Bob  in-progress
2     John       closed
3     Lara  in-progress
4    Santa       closed

Notice Customer Alice. She occurs in both d1.Customer and d2.Customer, so only the corresponding row from d1 needs to be retained. All other customers in d1 and d2 are unique so their corresponding rows end up in the final table. How can I accomplish this?

>Solution :

What you’re asking to do is to concatenate the dataframes, and then drop the duplicates, keeping the first item for each duplicate. That is what the code below does:

pd.concat([df1, df2], axis=0, ignore_index=True).drop_duplicates(subset=['Customer'], keep='first')

A couple of notes:

  1. drop_duplicates by default keeps the first value, so keep='first' is unnecessary (just there to be illustrative)
  2. you could probably use _append, but that is for internal pandas use, so concat is better

You can read the documentation for these functions here:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading