Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas : remove duplicated rows using mode

Using a pandas.Dataframe, how should I remove duplicated (based on multiple columns) rows using the mode of another column ?

import pandas as pd

df = pd.DataFrame(
    data={
        "col_1": [0, 0, 0, 0, 1, 1, 1, 1],
        "col_2": [1, 1, 1, 1, 2, 2, 2, 2],
        "col_3": [5, 5, 0, 1, 8, 8, 0, 1],
        "another_column": [0, 0, 0, 0, 0, 0, 0, 0],
    }
)

# the following line shows the correct answer but doesn't return original dataframe
# with only the two unique rows
print(df.groupby(by=["col_1", "col_2"])["col_3"].agg(lambda x: x.mode()[0]))

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Use GroupBy.transform and compare original column col_3 in boolean indexing:

s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)]
print (df1)
   col_1  col_2  col_3  another_column
0      0      1      5               0
1      0      1      5               0
4      1      2      8               0
5      1      2      8               0

If need first row per groups:

s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)].drop_duplicates(["col_1", "col_2"])
print (df1)
   col_1  col_2  col_3  another_column
0      0      1      5               0
4      1      2      8               0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading