i have a dataframe which contains a client code, the number of the contract and the products from the contract.
Something like this :
| client_code | contract_number | product |
|---|---|---|
| AAAA | 1000 | Water |
| AAAA | 1000 | Soda |
| AAAA | 1000 | Food |
| BACD | 1001 | Water |
| BACD | 1001 | Soda |
| DAMR | 1002 | Food |
And I want to add a column which contains a count to see how many products are on a contract from 1 to n . Something like this:
| client_code | contract_number | product | count |
|---|---|---|---|
| AAAA | 1000 | Water | 1 |
| AAAA | 1000 | Soda | 2 |
| AAAA | 1000 | Food | 3 |
| BACD | 1001 | Water | 1 |
| BACD | 1001 | Soda | 2 |
| DAMR | 1002 | Food | 1 |
I’ve tried with a for loop but it’s too slow ( like an hour ).
df['count']=0
k=1
for i in range(0,len(df)-1):
if(df.at[i,'contract_number']==df.at[i+1,'contract_number']):
df.at[i,'count']=k
k=k+1
else :
df.at[i,'count']=k
k=1
df.at[i+1,'count']=k
Any ideea how to speed up my code ?
PS : My data frame contains 500.000 lines .
Thank you !
>Solution :
IIUC, you want a cumulative count by each client_code (or probably contract_number) – you can do that with the cumcount function:
df.loc[:, 'count'] = df.groupby('client_code').cumcount() + 1
Output
client_code contract_number product count
0 AAAA 1000 Water 1
1 AAAA 1000 Soda 2
2 AAAA 1000 Food 3
3 BACD 1001 Water 1
4 BACD 1001 Soda 2
5 DAMR 1002 Food 1