spread then subtract consecutive rows in pandas data frame

April 1, 2022

I have a data frame where I need to pair consecutive events within the same day and subtract the later from the former. Each has a timestamp and a date.

         time                     date   event   score
0 2022-03-07 06:45:00+00:00  2022-03-07  light   80.066667
1 2022-03-07 18:12:00+00:00  2022-03-07  dark   79.857667
2 2022-03-30 06:25:00+00:00  2022-03-30  light  107.060833
3 2022-03-30 13:38:00+00:00  2022-03-30  dark  105.324000
4 2022-03-30 13:40:00+00:00  2022-03-30  dark  105.239750
5 2022-03-30 15:47:00+00:00  2022-03-30  light  106.863143
6 2022-04-01 06:25:00+00:00  2022-04-01  light  101.271867

I have tried spreading the data frame using

df = df.pivot(index='time', columns='event', values='score')

event                             light       dark
time                                             
2022-03-07 06:45:00+00:00   80.066667         NaN
2022-03-07 18:12:00+00:00         NaN   79.857667
2022-03-30 06:25:00+00:00  107.060833         NaN
2022-03-30 13:38:00+00:00         NaN  105.324000
2022-03-30 13:40:00+00:00         NaN  105.239750
2022-03-30 15:47:00+00:00  106.863143         NaN
2022-04-01 06:25:00+00:00  101.271867         NaN

however because the events happen at different times the spread data frame has NaNs. I would ideally end up with this, where I keep the time of the first occurrence in the pair (light or dark), align the events (note: corresponding dark to match light as not yet occurred for 2022-04-01) and when light comes first I subtract dark from light and when dark comes first I subtract dark from light.

event                             light  dark          diff
time                                             
2022-03-07 06:45:00+00:00   80.066667    79.857667    -0.208999
2022-03-30 06:25:00+00:00  107.060833    105.324000   -1.7368
2022-03-30 13:40:00+00:00  106.863143    105.239750   -1.6233
2022-04-01 06:25:00+00:00  101.271867    NaN          NaN

>Solution :

Here’s one way using groupby + cumcount to create groups, then use that groups in another groupby + first to get the first time each event happens on a day. Then pivot:

out = (df.assign(time=df.groupby(df.groupby('event').cumcount())['time'].transform('first'))
       .pivot('time', 'event', 'score').reset_index().rename_axis([None], axis=1))

Output:

                        time        dark       light
0  2022-03-07 06:45:00+00:00   79.857667   80.066667
1  2022-03-30 06:25:00+00:00  105.324000  107.060833
2  2022-03-30 13:40:00+00:00  105.239750  106.863143
3  2022-04-01 06:25:00+00:00         NaN  101.271867