Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby user and count number of events between 2 timestamps

I have a DF1 where each row represents an "event". Each event has the columns "user", and "time":

DF1:

"user","time"
user1,2022-11-14 00:00:04 
user2,2022-11-16 21:34:45
user1,2022-11-14 00:15:22
user3,2022-11-17 15:32:25
...

The "time" value is any timestamp in one week: from 2022-11-14 and 2022-11-20. There are 10k different users, and 27M events.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have to divide the week in 8h time-slots (so 21 slots in total), and for each user, I need to look if that I can see any event of that user in each slot.

Then, I should create a DF2 (in which each row is a user) with 21 columns (one for each slot), with numbers 0 or 1: 0 if I have not seen the user in that slot, and 1 if I have seen the user in that slot.

DF2:

"user","slot1","slot2","slot3",...,"slot21"
user1,1,0,0,0,0,0,...,0
user2,0,0,1,1,1,0,...,0
user3,1,1,1,0,0,1,...,1
...

(After that, I will need to order DF2 and plot it as an spare matrix, but that is another story…)

I have managed to fill 1 single row of DF2, but it lasts 30 seconds for 1 user, in this way:

slots = pd.date_range('2022-11-14', periods=22, freq='8h')
row=[]
for i in np.arange(0,slots.value_counts().sum()-1):
    if DF1[(DF1.user=="user1")&(DF1.time.between(slots[i],slots[i+1]))].shape[0] >=1:
        row.append(1)
    else:
        row.append(0)
print(row) 

So making this process for the 10k users would last almost 4 days…

Anyone have an idea on how can I achieve to create DF2 in a quicker way??
Maybe something like DF1.groupby(‘user’).time and then what else?
I can be done in pandas or with any other way, or even different languages, if I get the spare matrix in DF2!

Any help would be much appreciated!

>Solution :

Use crosstab with cut for count values, if need 0,1 ouput compare for not equal 0 and cast to integers:

df = (pd.crosstab(df['user'], 
                 pd.cut(df['time'], bins=slots, labels=False))
        .ne(0)
        .astype(int)
        .add_prefix('slot'))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading