I have a dataframe looking like this:
| Timestamp | description |
|---|---|
| 0 | Parser starts |
| 12 | parsing |
| 24 | parsing |
| 26 | Parsing finished |
| 28 | Parser starts |
| 45 | Parsing finished |
I want to calculate the how long each parse took. I therefore want the difference between timestamps where (df['description'] == 'Parsing finished') and (df['description'] == 'Parser starts'). I know I can use pd.diff() but I can only find how to use it with a set period. I want to set the period based on the description value.
Expected output:
| Timestamp | description | difference |
|---|---|---|
| 0 | Parser starts | NaN |
| 12 | parsing | NaN |
| 24 | parsing | NaN |
| 26 | Parsing finished | 26 |
| 28 | Parser starts | NaN |
| 45 | Parsing finished | 17 |
I thought of looping over each row but this seems counterintuitive when using Pandas.
EDIT: updated wrong value thanks to comment of @mozway. Made myself more clear with below table:
| Timestamp | description |
|---|---|
| 0 | Parser starts |
| 12 | parsing |
| 24 | parsing |
| 26 | Parsing finished |
| 27 | Uploading results |
| 28 | Parser starts |
| 45 | Parsing finished |
I do not want the timestamp of uploading results (or other values in between parser starts and parsing finished) to be part of the diff. Therefore grouping on parser starts does not provide the result Im looking for.
>Solution :
You can use a groupby:
import numpy as np
# make groups starting with "Parser starts"
group = df['description'].eq('Parser starts').cumsum()
# set up the grouper
g = df.groupby(group)
# update last value with ptp (= max - min)
df.loc[g.cumcount(ascending=False).eq(0),
'difference'] = g['Timestamp'].transform(np.ptp)
output:
Timestamp description difference
0 0 Parser starts NaN
1 12 parsing NaN
2 24 parsing NaN
3 26 Parsing finished 26.0
4 28 Parser starts NaN
5 45 Parsing finished 17.0