Home Python – Generate new dataframe with first occurance of a value and make it faster

Questions

Python – Generate new dataframe with first occurance of a value and make it faster

December 4, 2023

I have a pandas dataframe as below with 5 million rows. Let’s say, it is about a product manufactuing and represents stages of product as the output by every batch and sub_batch.

 date   batch   sub_batch   output
Jan-22    A         A1        new   
Feb-22    A         A1        new   
Mar-22    A         A1        wip   
Apr-22    A         A1        wip   
May-22    A         A1        wip   
Jun-22    A         A1        done  
Jul-22    A         A1        done  
Aug-22    A         A1        done  
Jan-23    A         A2        new   
Feb-23    A         A2        new   
Mar-23    A         A2        wip   
Apr-23    A         A2        wip   
May-23    A         A2        wip   
Jun-23    A         A2        done  
Jul-23    A         A2        done  
Jan-23    B         B1        new   
Feb-23    B         B1        done  
Mar-23    B         B1        done  
Apr-23    B         B1        done  
May-23    B         B2        new   
Jun-23    B         B2        halt  
Jul-23    B         B2        halt  
Aug-23    B         B2        new   
Sep-23    B         B2        wip   
Oct-23    B         B2        wip   
Nov-23    B         B2        done  
Dec-23    B         B2        done  
  .       .         .          .
  .       .         .          .

What I am trying to achieve is, once a sub_batch hits "done" then stop there and delete all next "done" status of that sub_batch.

So, in new dataframe, only take 1st occurance of every "done" by every sub_batch and keep all other output data as it is.

Also, it is important to maintain the sort order the same way. A "done" status comes at the end only. There are some sub_batch which are always new or wip and never get "done" which should be as it is. Concern is to only take 1st "done" and move ahead to finding next one and so on.

The result dataframe will look like this:

date    batch   sub_batch   output
Jan-22    A         A1        new   
Feb-22    A         A1        new   
Mar-22    A         A1        wip   
Apr-22    A         A1        wip   
May-22    A         A1        wip   
Jun-22    A         A1        done
Jan-23    A         A2        new   
Feb-23    A         A2        new   
Mar-23    A         A2        wip   
Apr-23    A         A2        wip   
May-23    A         A2        wip   
Jun-23    A         A2        done  
Jan-23    B         B1        new   
Feb-23    B         B1        done  
May-23    B         B2        new   
Jun-23    B         B2        halt  
Jul-23    B         B2        halt  
Aug-23    B         B2        new   
Sep-23    B         B2        wip   
Oct-23    B         B2        wip   
Nov-23    B         B2        done

My approach was :
I created 2 columns with initial value 0 and assigned them value 1 by iterating over
by some logic that I have thought of

df["col1"] = 0
df["col2"] = 0

#make a list with all outputs excluding "done"

ls = list(set(df["output"]))
ls.remove("done")


#Below code will assign value 1 to all outputs except "done", done will be 0 value

df["col1"] = [1 if item in ls else 0 for item in df["output"]]


#Now, for col2,
#Below code will assign value 1 to only 1st occurance of "done" and all others will be 0.

unique_sub_batch = set(df["sub_batch"])

for sub in unique_sub_batch:
     for index, item in enumerate(df["output"]):
             if df["sub_batch"][index] == sub and item == "done":
                  df["col2"][index] = 1
                  break


#Then, I sum these two col1 and col2

df["sum"] = df["col1"] + df["col2"]


#Then, I select where sum=1, meaning I only get 1st occurce of "done" as the loop breaks
#and then continues with finding other values

df = df[df["sum"]==1]
df.drop(columns = {"col1", "col2", "sum"}, inplace = True)

My code works and gives the result after 2 days 😂, could you please help in making it faster by suggesting some alternate way?

Thank you so much for help and sorry for the pain of going through such long question.

>Solution :

If are dones only in the end each group and subgroup is possible simplify solution by DataFrame.duplicated with inverted mask by ~ chained with compare non done rows:

out = df[df['output'].ne('done') | ~df.duplicated(['batch','sub_batch','output'])]

print (out)
      date batch sub_batch output
0   Jan-22     A        A1    new
1   Feb-22     A        A1    new
2   Mar-22     A        A1    wip
3   Apr-22     A        A1    wip
4   May-22     A        A1    wip
5   Jun-22     A        A1   done
8   Jan-23     A        A2    new
9   Feb-23     A        A2    new
10  Mar-23     A        A2    wip
11  Apr-23     A        A2    wip
12  May-23     A        A2    wip
13  Jun-23     A        A2   done
15  Jan-23     B        B1    new
16  Feb-23     B        B1   done
19  May-23     B        B2    new
20  Jun-23     B        B2   halt
21  Jul-23     B        B2   halt
22  Aug-23     B        B2    new
23  Sep-23     B        B2    wip
24  Oct-23     B        B2    wip
25  Nov-23     B        B2   done

If need remove all values after first done use GroupBy.cummax and add rows with first done per groups by batch and sub_batch:

m = df['output'].eq('done')
out = df[~m.groupby([df['batch'], df['sub_batch']]).cummax() |
         (m & ~df.duplicated(['batch','sub_batch','output']))]