User trip anaysis from pandas dataframe

June 13, 2022

I have a dataframe containing users’ trips and mode of travel used to complish the trips. In the original trips file, some trips are actually a single trip but split into 2 trips due to user changing travel mode or a stop for some minutes. I want to make analysis number of trips covered for certain period, but clearly wouldn’t like counting a single trip twice, as such I needed a way to identify such trips and only count them once.

For example, consider this example dataframe (extract from original file):

df = pd.DataFrame({
    'user': [62,62,62,62,62,62,62,62,62],
    'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2008-06-21 04:23:39',
              '2008-06-21 04:47:53','2008-07-13 05:45:27','2008-07-13 06:47:57',
              '2008-07-14 09:08:06','2008-07-14 13:12:46','2008-07-14 13:24:23'],
    'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11','2008-06-21 04:35:15',
            '2008-06-21 05:43:20','2008-07-13 06:02:54','2008-07-13 07:20:19',
            '2008-07-14 09:17:15','2008-07-14 13:20:10','2008-07-14 13:30:43'],
    'mode': ['bus','walk','bus','bus','bus','bus','taxi','bus','bus']
})

So that:

df
    user        start                   end            mode
0   62  2008-06-20 04:21:40     2008-06-20 05:33:46     bus
1   62  2008-06-20 05:40:31     2008-06-20 05:53:11     walk
2   62  2008-06-21 04:23:39     2008-06-21 04:35:15     bus
3   62  2008-06-21 04:47:53     2008-06-21 05:43:20     bus
4   62  2008-07-13 05:45:27     2008-07-13 06:02:54     bus
5   62  2008-07-13 06:47:57     2008-07-13 07:20:19     bus
6   62  2008-07-14 09:08:06     2008-07-14 09:17:15     taxi
7   62  2008-07-14 13:12:46     2008-07-14 13:20:10     bus
8   62  2008-07-14 13:24:23     2008-07-14 13:30:43     bus

We can see from this particular user’s trips that:

the first trip ended at 05.33 and the second began at 05.40. This is a single trip only that the user stayed for about 7-minutes and then changed travel mode bus -> walk.
the third trip ended at 04.35 and the fouth began at 04.47. We consider this a single trip since the time in-between is less than 20-minutes. Perhaps the user experienced data logging issued or stayed for about 12-minutes then continued.
the fifth trip ended at 06:02 and the sixth began at 06:47. We consider these as 2-seperate trips since the time interval is beyond 20-minutes.
the 8th trip ended at 13.20 and the 9th began at 13.24. This is a single trip as well.

In the given example, this user covers 6-distinct trips, which could otherwise be wrongly counted as 9-trips. I have over 50 distinct users covering several trips, so this could lead to serious error in the trip analysis.

We consider 2 consecutive user’s trips as single trip, if the time interval is less than 20 minutes.

I am not sure how to do such analyses of distinct trips using pandas.

>Solution :

You can shift the end per group and calculate the difference to the next start. Then use it to compute groups:

# ensure datetime
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)

# sort by user/start
df2 = df.sort_values(by=['user', 'start', 'end'])

# if end is within 20 min of next start, then keep in same group
group = df2['start'].sub(df2.groupby('user')['end'].shift()).gt('20 min').cumsum()
df['group'] = group

output:

   user               start                 end  mode  group
0    62 2008-06-20 04:21:40 2008-06-20 05:33:46   bus      0
1    62 2008-06-20 05:40:31 2008-06-20 05:53:11  walk      0
2    62 2008-06-21 04:23:39 2008-06-21 04:35:15   bus      1
3    62 2008-06-21 04:47:53 2008-06-21 05:43:20   bus      1
4    62 2008-07-13 05:45:27 2008-07-13 06:02:54   bus      2
5    62 2008-07-13 06:47:57 2008-07-13 07:20:19   bus      3
6    62 2008-07-14 09:08:06 2008-07-14 09:17:15  taxi      4
7    62 2008-07-14 13:12:46 2008-07-14 13:20:10   bus      5
8    62 2008-07-14 13:24:23 2008-07-14 13:30:43   bus      5

Aggregated data:

(df
 .groupby(group)
 .agg({'user': 'first', 'start': 'first', 'end': 'max',
       'mode': lambda x: '+'.join(set(x))})
)

output:

   user               start                 end      mode
0    62 2008-06-20 04:21:40 2008-06-20 05:53:11  bus+walk
1    62 2008-06-21 04:23:39 2008-06-21 05:43:20       bus
2    62 2008-07-13 05:45:27 2008-07-13 06:02:54       bus
3    62 2008-07-13 06:47:57 2008-07-13 07:20:19       bus
4    62 2008-07-14 09:08:06 2008-07-14 09:17:15      taxi
5    62 2008-07-14 13:12:46 2008-07-14 13:30:43       bus