Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove duplicated rows based on values in column

Good morning guys. So my problem is to remove duplicates from dataframe caused by many diffrent values in one of columns.

The base dataframe looks like this below:

enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

As You can see, I have duplicated values in columns Name and Id depends on Category.
Our goal is to remove those duplicates while keeping the information about category.

I would like to have the exact view as here below:

enter image description here

I have tried to use get_dummies method from pandas library but i have some issues.

dummies = pd.get_dummies(df[['Category']], drop_first=True)
df = pd.concat([df.drop(['Category'], axis=1), dummies], axis=1)

Using the code above i’m getting the result like this below:

enter image description here

The result is basicly still the same as base dataframe.

Do You guys have any idea how to deal with it?

>Solution :

It depends what need – if possible duplicates per Name and Id is necessary aggregate max:

df = (pd.get_dummies(df, columns=['Category'])
        .groupby(['Name','Id'], as_index=False)
         .max())
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  ABC   2           0           1           0
2  DEF   2           1           0           0
3  GHI   3           0           0           1
4  JKL   4           0           0           1
5  MNO   5           1           0           0

If need aggregate per Id with last value for non numeric values use:

f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = (pd.get_dummies(df, columns=['Category'])
        .groupby('Id', as_index=False)
         .agg(f))
print (df)
   Id Name  Category_A  Category_B  Category_C
0   1  ABC           1           0           0
1   2  DEF           1           1           0
2   3  GHI           0           0           1
3   4  JKL           0           0           1
4   5  MNO           1           0           0

In second solution is possible specify columns for aggregations:

# f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = pd.get_dummies(df, columns=['Category'])
         
d = dict.fromkeys(df.columns, 'max')
d['Name'] = 'last'
print (d)
{'Name': 'last', 'Id': 'max', 'Category_A': 'max', 'Category_B': 'max', 'Category_C': 'max'}

df = df.groupby('Id', as_index=False).agg(d)
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  DEF   2           1           1           0
2  GHI   3           0           0           1
3  JKL   4           0           0           1
4  MNO   5           1           0           0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading