Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Determine in how many categories users are logged per time window unit

I have a log of users and in which category it is logged. Users can be logged in multiple categories. I would like to determine which users are logged in multiple categories. The log is kinda long so preferably it would be sorted on users on top that have been logged in most categories.

user category
1 A
1 B
2 A
3 A
3 B

Currently using group_by I’m only able to show the counts, but not the names of the categories. Most existing questions are usually asking it the other way around (e.g. how many users are there in a category). I would like to do something like this:

user categories count
1 A, B 2
2 A 1
3 A, B 2

To make it more complicated, I would like to determine the same but then I would also like to know if users are logged in multiple categories within specific time windows (e.g. 5 minutes):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

timestamp user category
2021-12-12 13:00:00 1 A
2021-12-12 13:06:00 1 B
2021-12-12 13:08:00 2 A
2021-12-12 13:09:00 1 B
2021-12-12 13:14:00 3 A
2021-12-12 13:15:00 3 B
2021-12-12 13:15:00 3 A
2021-12-12 13:15:00 1 B

I would like something like below (but open to different display methods) so the categories per user are kinda binned per time-window (5 min in this example):

timestamp user categories count
2021-12-12 13:00:00 1 A 1
2021-12-12 13:05:00 1 B 1
2021-12-12 13:05:00 2 A 1
2021-12-12 13:05:00 1 B 1
2021-12-12 13:10:00 3 A 1
2021-12-12 13:15:00 3 B, A 2
2021-12-12 13:15:00 1 B 1

I tried all kinds of things with combinations .resample, groub_by and .cumsum without any luck. I hope the examples make sense.

>Solution :

Use GroupBy.agg by column user:

df1 = (df.groupby('user', as_index=False)
         .agg(categories=('category', ','.join), counts=('category', 'size')))
print (df1)
   user categories  counts
0     1        A,B       2
1     2          A       1
2     3        A,B       2

For second my output is different – is used Grouper with freq='5min':

df2 = (df.groupby(['user', pd.Grouper(freq='5min', key='timestamp')])
         .agg(categories=('category', ','.join), counts=('category', 'size'))
         .reset_index()
         .sort_values('timestamp', ignore_index=True))
print (df2)
   user           timestamp categories  counts
0     1 2021-12-12 13:00:00          A       1
1     1 2021-12-12 13:05:00        B,B       2
2     2 2021-12-12 13:05:00          A       1
3     3 2021-12-12 13:10:00          A       1
4     1 2021-12-12 13:15:00          B       1
5     3 2021-12-12 13:15:00        B,A       2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading