Pseudonymize data in a pandas DataFrame

January 25, 2023

I have a pandas dataframe that looks like this:

id	user	action	timestamp
1	Jim	start	12/10/2022
2	Jim	start	12/10/2022
3	Jim	end	2/2/2022
4	Linette	start	8/18/2022
5	Linette	start	3/24/2022
6	Linette	end	8/27/2022
7	Rachel	start	2/7/2022
8	Rachel	end	1/4/2023
9	James	start	6/12/2022
10	James	end	5/14/2022
11	James	start	11/28/2022
12	James	start	8/9/2022
13	James	end	2/15/2022

For each user, there can be more than one start event, but only one end. Imagine that they sometimes need to start a book over again, but only finish it once.

What I want is to calculate the time difference between the first start and the end, so keep, for each user, the first occurrence of "start" and "end" in each group.

Any hint?

>Solution :

>>> (df.groupby(["user", "action"], sort=False)["timestamp"]
       .first()
       .droplevel("action")
       .diff().iloc[1::2])

user
James       29 days
Jim        311 days
Linette     -9 days
Rachel    -331 days
Name: timestamp, dtype: timedelta64[ns]

for "timestamp" of each "user" & "action" pair, get the first occurences
- this will effectively take the first start, and the (only) end
then drop the carried over "action" level of groupers
take the difference from ends and starts
take every other value to get per-user difference

(sort=False ensures during groupby that start’s don’t get mixed up.)