How to filter the rows based on a column value group by results

I have the below dataframe

import pandas as pd
data= [['A','2022-07-01',3],['A','2022-07-01',4],['A','2022-07-01',5],['A','2022-07-02',5],['A','2022-07-03',6],['A','2022-07-03',2],['B','2022-07-01',3],['B','2022-07-01',4],['B','2022-07-02',5],['B','2022-07-02',6],['B','2022-07-03',2],['C','2022-07-01',3],['C','2022-07-02',4],['C','2022-07-02',5],['C','2022-07-03',6],['C','2022-07-04',2]]
df = pd.DataFrame(data,columns =['category','date','Value'])


I would like to get the all the rows from ‘category’ which are having duplicate date entries of greater than 1.

Category A has three entries of date 2022-07-01, two entries of 2022-07-03..so its unique duplicate date occurence(2022-07-01,2022-07-03) is two, which more than 1

B has two entries of date 2022-07-01, two entries of 2022-07-02..so its unique duplicate date occurence(2022-07-01,2022,07,02) is two, which is more than 1

Where as

C has only one date duplicate entry (2022-07-02), so this C category rows all should be removed from the filter output and only all rows of A and B are required

Please advice. I tried groupby with transform(len)>1 but its doesn’t help out.

>Solution :

First run value_counts on category+date, then count the number of duplicated values with groupy.sum:

s = (
 df[['category', 'date']].value_counts().gt(1)
 .groupby(level='category').sum()
)

out = list(s[s.gt(1)].index)

Output: ['A', 'B']

Leave a Reply