I have a pandas dataframe that looks like this:
For each user, there can be more than one start event, but only one end. Imagine that they sometimes need to start a book over again, but only finish it once.
What I want is to calculate the time difference between the first start and the end, so keep, for each user, the first occurrence of "start" and "end" in each group.
>>> (df.groupby(["user", "action"], sort=False)["timestamp"] .first() .droplevel("action") .diff().iloc[1::2]) user James 29 days Jim 311 days Linette -9 days Rachel -331 days Name: timestamp, dtype: timedelta64[ns]
- for "timestamp" of each "user" & "action" pair, get the first occurences
- this will effectively take the first start, and the (only) end
- then drop the carried over "action" level of groupers
- take the difference from ends and starts
- take every other value to get per-user difference
(sort=False ensures during groupby that start’s don’t get mixed up.)