Using Pandas to drop the extra rows after satisfying a condition

July 4, 2022

My sample df looks like this

id   year    success
1    2000      N
1    2001      N
1    2002      Y
1    2003      N
1    2004      N
2    2000      N
2    2001      N
2    2002      N
3    2000      N
3    2001      N
3    2002      Y
....

Here, we can see that the id==1 and id==3 has both success==Y and success==N but id==2 only has success==N

What I want to do?

I want to only have rows in which if we find the first success==Y, we drop the remaining column for that group, eg id==1

This is how the new df should look.

id   year    success
1    2000      N
1    2001      N
1    2002      Y
2    2000      N
2    2001      N
2    2002      N
3    2000      N
3    2001      N
3    2002      Y
....

Here, in the above df we removed the extra rows after we encountered success==Y. Since, id==2 does not have success==Y, we did not remove any rows and in id==2, the last row is the Y so no rows were removed.

What I did?

I tried to group the id but then I want all the results even though I have duplicate ids. So this did not work.

Could someone please help me achieve this result?

>Solution :

First check success column against value Y, which gives you True value where success=Y; To mark all values after first Y as drop, we can further use cummax on this condition; Finally use the negated condition to filter the data frame:

df[~df.success.eq('Y').groupby(df.id).apply(lambda g: g.cummax().shift(fill_value=False))]

    id  year success
0    1  2000       N
1    1  2001       N
2    1  2002       Y
5    2  2000       N
6    2  2001       N
7    2  2002       N
8    3  2000       N
9    3  2001       N
10   3  2002       Y