count if event occurs per city

I am trying to count how often an event occurs before another event occurs per city. To be more specific I am trying to count how often event number 21 occurrs before 2, and the other way around.

The dataset looks like this

import pandas as pd

data = { 'city': ['Amsterdam', 'Vienna', 'Paris', 'Paris',  'Istanbul', 'Istanbul','Delhi', 'London', 'London', 'Barcelona', 'Barcelona'], 
        'date': [ '2022-09-01T11:34:53', '2022-09-01T13:37:37', '2022-09-01 10:44:22.000', '2022-09-01T10:39:33', '2022-09-01 16:18:24.000', '2022-09-01T16:15:14', '2022-09-01T13:28:33', '2022-09-01 15:50:54.000', '2022-09-01T15:51:07', '2022-09-01 12:24:26.000','2022-09-01T12:24:07' ], 
        'year': [ '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022','2022' ],
        'month': [9,9,9,9,9,9,9,9,9,9,9 ],
        'hour': [ 11,13,11,10,17,16,13,16,16,13,12 ],
        'eventcode': [ 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J','J' ],
        'eventnumber': [ '21', '21', '2', '21', '2', '21', '21', '2', '21', '2','21' ]}

df =  pd.DataFrame(data, columns= ['city', 'date', 'year',  'month', 'hour', 'eventcode','eventnumber' ])
print (df) 

In this dataset, when looking at the city and date, event 2 occurs 1 time before event 21, and event 21 occurs 3 times before event 2.

The code below works for counting the number of occurences of event 21 happening before 2. But when I switch the 2 and 21 in the code, it results in an infinite loop…

bc=np.array(df['city'])
un_bc,bc_index,bc_count=np.unique(bc,return_counts=True,return_index=True)
new_df=pd.DataFrame()
count=0

for i,j in zip(bc_index,bc_count):
    j=j+i-1
    while i+1 <= j:
        if df.iat[i,7]==21 and df.iat[i+1,7]==2:
            count +=1
            new_df=new_df.append(df[i:i+2])
        i +=1
        
print(count)

Code above works, code below does not.

for i,j in zip(bc_index,bc_count):
    j=j+i-1
    while i+1 <= j:
        if df.iat[i,7]==2 and df.iat[i+1,7]==21:
            count +=1
            new_df=new_df.append(df[i:i+2])
        i +=1
        
print(count)

>Solution :

One solution could be as follows:

df['date'] = pd.to_datetime(df['date'])

res = df.sort_values('date').astype({'eventnumber':int})\
    .groupby('city')['eventnumber'].diff().value_counts()
res.index = res.index.map({-19:'21_before_2', 19: '2_before_21'})

print(res)

21_before_2    3
2_before_21    1
Name: eventnumber, dtype: int64

Or, if you don’t want to turn df['date'] into datetime, use .sort_values(['year', 'month', 'hour']).

Explanation

  • First, use pd.to_datetime to turn date strings into datetime.
  • Next, 1) sort the df on date (using df.sort_values), 2) turn eventnumber strings into ints (using df.astype), 3) apply df.groupby on column city, 4) retrieve diff, and 5) retrieve Series.value_counts.
  • The result will be a count for -19 (21 before 2) and 19 (2 before 21), with all cities with just one value being dropped. These values will end up as the index, so we can use map to assign the appropriate labels.

Leave a Reply