Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Detecting duplicates in pandas when a column contains lists

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels… wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I’m going after.

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

Both methods (the latter from this alternative find duplicates answer) result in

TypeError: unhashable type: ‘list’.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

They would work, of course, if the dataframe looked like this:

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

Which made me wonder if something like integer encoding might be reasonable? It’s not that different from converting to/from strings, but at least it’s legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).

>Solution :

With map tuple

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading