Follow

Follow

Contact

Home When your pandas df has a column of list type removing duplicates from each item

Questions

When your pandas df has a column of list type removing duplicates from each item

byMR

April 21, 2022

My dataframe has a column of lists and looks like this.

     id  source
0    3   [nan,nan,nan]
1    5   [nan,foo,foo,nan,foo]
2    7   [ham,nan,ham,nan]
3    9   [foo,foo]

I need to remove duplicates from each list. So I am looking from something like below.

     id  source
0    3   [nan]
1    5   [nan,foo]
2    7   [ham,nan]
3    9   [foo]

I tried to use the following code which didn’t work. What do you recommend?

df['source'] = list(set(df['source']))

>Solution :

You can .explode on source column, .drop_duplicates and .groupby back:

df = (
    df.explode("source")
    .drop_duplicates(["id", "source"])
    .groupby("id", as_index=False)
    .agg(list)
)
print(df)

Prints:

   id      source
0   3       [nan]
1   5  [nan, foo]
2   7  [ham, nan]
3   9       [foo]

Or convert the list to pd.Series, drop duplicates and convert back to list:

df["source"] = df["source"].apply(lambda x: [*pd.Series(x).drop_duplicates()])
print(df)

byMR

Published April 21, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

list of dictionaries with same key and value to data frame

byMR

April 21, 2022

Questions

How do I enter a type command onto my selenium script and also turn it into headless mode?

byMR

April 21, 2022

Questions

Importing module on python / jupyter

byMR

April 21, 2022

Questions

Repeat character depending on position in array + 1

byMR

April 21, 2022

Questions

How to reference the key of `Record<>` in the record value

byMR

April 21, 2022

Questions

Deploying to the CloudHub fails with the error in the YAML

byMR

April 21, 2022