dataframe concatenation retain first value for duplicates

August 11, 2023

I have 2 dataframes (say d1 and d2) both with columns c1 and c2. I’d like to concatenate them. However, for all values that occur in column c1 in both tables I’d like to retain only the row from d1.

Example:

df1 = pd.DataFrame({"Customer":["Alice", "Bob", "John"],
                    "Status":["closed","in-progress","closed"]})

df2 = pd.DataFrame({"Customer":["Alice", "Lara", "Santa"],
                    "Status":["in-progress","in-progress","closed"]})

desired_result = pd.DataFrame({"Customer":["Alice", "Bob", "John", "Lara", "Santa"],
                    "Status":["closed","in-progress","closed","in-progress","closed"]})

d1:

  Customer       Status
0    Alice       closed
1      Bob  in-progress
2     John       closed

d2:

  Customer       Status
0    Alice  in-progress
1     Lara  in-progress
2    Santa       closed

desired_result:

  Customer       Status
0    Alice       closed
1      Bob  in-progress
2     John       closed
3     Lara  in-progress
4    Santa       closed

Notice Customer Alice. She occurs in both d1.Customer and d2.Customer, so only the corresponding row from d1 needs to be retained. All other customers in d1 and d2 are unique so their corresponding rows end up in the final table. How can I accomplish this?

>Solution :

What you’re asking to do is to concatenate the dataframes, and then drop the duplicates, keeping the first item for each duplicate. That is what the code below does:

pd.concat([df1, df2], axis=0, ignore_index=True).drop_duplicates(subset=['Customer'], keep='first')

A couple of notes:

drop_duplicates by default keeps the first value, so keep='first' is unnecessary (just there to be illustrative)
you could probably use _append, but that is for internal pandas use, so concat is better

You can read the documentation for these functions here:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

https://pandas.pydata.org/docs/reference/api/pandas.concat.html