Create new variable based on change in another variable in Python

August 22, 2022

import pandas as pd

df = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"]}
df = pd.DataFrame(data=df)
df

I would like to create another variable based on the change in Status over time for each group such that they are considered a "completer" the next row after they have Status = "Complete"

For example I would like to create the "completer" column in the df2 table:

df2 = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"],
             'Completer': ["Non-Completer",
                          "Non-Completer",
                          "Non-Completer",
                          "Completer",
                          "Non-Completer",
                          "Completer",
                          "Completer",
                          "Completer"]}
df2 = pd.DataFrame(data=df2)
df2

Thanks!

>Solution :

I was able to solve it in two steps.

Firstly I created a column which marks the first date that each group has "Completed", and added 1 to it so that we update values starting from the next row.

Secondly, since True / False translates to 1 / 0 respectively I used a groupby on Group and cummax, which would update all following rows per group to True.

Then lastly I just used replace and dropped the helping column.

df['first_date_per_group'] = df.index.isin(df.loc[df['Status'].eq('Complete')]['Group'].index + 1)

df = df.assign(Completer=df.groupby('Group')['first_date_per_group'].cummax()).replace(
    {True:'Completer',False:'Non-Completer'}).drop('first_date_per_group',axis=1)

prints:

         Date    Status Group      Completer
0  2011-10-19   Pending     a  Non-Completer
1  2013-01-14   Pending     a  Non-Completer
2  2014-05-27  Complete     a  Non-Completer
3  2014-06-23   Pending     a      Completer
4  2014-08-12  Complete     b  Non-Completer
5  2014-09-22   Pending     b      Completer
6  2014-09-22   Pending     b      Completer
7  2014-09-22   Pending     b      Completer