My sample df looks like this
id year success
1 2000 N
1 2001 N
1 2002 Y
1 2003 N
1 2004 N
2 2000 N
2 2001 N
2 2002 N
3 2000 N
3 2001 N
3 2002 Y
....
Here, we can see that the id==1 and id==3 has both success==Y and success==N but id==2 only has success==N
What I want to do?
- I want to only have
rowsin which if we find the firstsuccess==Y, we drop the remaining column for that group, egid==1
This is how the new df should look.
id year success
1 2000 N
1 2001 N
1 2002 Y
2 2000 N
2 2001 N
2 2002 N
3 2000 N
3 2001 N
3 2002 Y
....
Here, in the above df we removed the extra rows after we encountered success==Y. Since, id==2 does not have success==Y, we did not remove any rows and in id==2, the last row is the Y so no rows were removed.
What I did?
- I tried to group the
idbut then I want all the results even though I have duplicate ids. So this did not work.
Could someone please help me achieve this result?
>Solution :
First check success column against value Y, which gives you True value where success=Y; To mark all values after first Y as drop, we can further use cummax on this condition; Finally use the negated condition to filter the data frame:
df[~df.success.eq('Y').groupby(df.id).apply(lambda g: g.cummax().shift(fill_value=False))]
id year success
0 1 2000 N
1 1 2001 N
2 1 2002 Y
5 2 2000 N
6 2 2001 N
7 2 2002 N
8 3 2000 N
9 3 2001 N
10 3 2002 Y