Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

User trip anaysis from pandas dataframe

I have a dataframe containing users’ trips and mode of travel used to complish the trips. In the original trips file, some trips are actually a single trip but split into 2 trips due to user changing travel mode or a stop for some minutes. I want to make analysis number of trips covered for certain period, but clearly wouldn’t like counting a single trip twice, as such I needed a way to identify such trips and only count them once.

For example, consider this example dataframe (extract from original file):

df = pd.DataFrame({
    'user': [62,62,62,62,62,62,62,62,62],
    'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2008-06-21 04:23:39',
              '2008-06-21 04:47:53','2008-07-13 05:45:27','2008-07-13 06:47:57',
              '2008-07-14 09:08:06','2008-07-14 13:12:46','2008-07-14 13:24:23'],
    'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11','2008-06-21 04:35:15',
            '2008-06-21 05:43:20','2008-07-13 06:02:54','2008-07-13 07:20:19',
            '2008-07-14 09:17:15','2008-07-14 13:20:10','2008-07-14 13:30:43'],
    'mode': ['bus','walk','bus','bus','bus','bus','taxi','bus','bus']
})

So that:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df
    user        start                   end            mode
0   62  2008-06-20 04:21:40     2008-06-20 05:33:46     bus
1   62  2008-06-20 05:40:31     2008-06-20 05:53:11     walk
2   62  2008-06-21 04:23:39     2008-06-21 04:35:15     bus
3   62  2008-06-21 04:47:53     2008-06-21 05:43:20     bus
4   62  2008-07-13 05:45:27     2008-07-13 06:02:54     bus
5   62  2008-07-13 06:47:57     2008-07-13 07:20:19     bus
6   62  2008-07-14 09:08:06     2008-07-14 09:17:15     taxi
7   62  2008-07-14 13:12:46     2008-07-14 13:20:10     bus
8   62  2008-07-14 13:24:23     2008-07-14 13:30:43     bus

We can see from this particular user’s trips that:

  • the first trip ended at 05.33 and the second began at 05.40. This is a single trip only that the user stayed for about 7-minutes and then changed travel mode bus -> walk.
  • the third trip ended at 04.35 and the fouth began at 04.47. We consider this a single trip since the time in-between is less than 20-minutes. Perhaps the user experienced data logging issued or stayed for about 12-minutes then continued.
  • the fifth trip ended at 06:02 and the sixth began at 06:47. We consider these as 2-seperate trips since the time interval is beyond 20-minutes.
  • the 8th trip ended at 13.20 and the 9th began at 13.24. This is a single trip as well.

In the given example, this user covers 6-distinct trips, which could otherwise be wrongly counted as 9-trips. I have over 50 distinct users covering several trips, so this could lead to serious error in the trip analysis.

We consider 2 consecutive user’s trips as single trip, if the time interval is less than 20 minutes.

I am not sure how to do such analyses of distinct trips using pandas.

>Solution :

You can shift the end per group and calculate the difference to the next start. Then use it to compute groups:

# ensure datetime
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)

# sort by user/start
df2 = df.sort_values(by=['user', 'start', 'end'])

# if end is within 20 min of next start, then keep in same group
group = df2['start'].sub(df2.groupby('user')['end'].shift()).gt('20 min').cumsum()
df['group'] = group

output:

   user               start                 end  mode  group
0    62 2008-06-20 04:21:40 2008-06-20 05:33:46   bus      0
1    62 2008-06-20 05:40:31 2008-06-20 05:53:11  walk      0
2    62 2008-06-21 04:23:39 2008-06-21 04:35:15   bus      1
3    62 2008-06-21 04:47:53 2008-06-21 05:43:20   bus      1
4    62 2008-07-13 05:45:27 2008-07-13 06:02:54   bus      2
5    62 2008-07-13 06:47:57 2008-07-13 07:20:19   bus      3
6    62 2008-07-14 09:08:06 2008-07-14 09:17:15  taxi      4
7    62 2008-07-14 13:12:46 2008-07-14 13:20:10   bus      5
8    62 2008-07-14 13:24:23 2008-07-14 13:30:43   bus      5

Aggregated data:

(df
 .groupby(group)
 .agg({'user': 'first', 'start': 'first', 'end': 'max',
       'mode': lambda x: '+'.join(set(x))})
)

output:

   user               start                 end      mode
0    62 2008-06-20 04:21:40 2008-06-20 05:53:11  bus+walk
1    62 2008-06-21 04:23:39 2008-06-21 05:43:20       bus
2    62 2008-07-13 05:45:27 2008-07-13 06:02:54       bus
3    62 2008-07-13 06:47:57 2008-07-13 07:20:19       bus
4    62 2008-07-14 09:08:06 2008-07-14 09:17:15      taxi
5    62 2008-07-14 13:12:46 2008-07-14 13:30:43       bus
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading