I am trying to count how often an event occurs before another event occurs per city. To be more specific I am trying to count how often event number 21 occurrs before 2, and the other way around.
The dataset looks like this
import pandas as pd
data = { 'city': ['Amsterdam', 'Vienna', 'Paris', 'Paris', 'Istanbul', 'Istanbul','Delhi', 'London', 'London', 'Barcelona', 'Barcelona'],
'date': [ '2022-09-01T11:34:53', '2022-09-01T13:37:37', '2022-09-01 10:44:22.000', '2022-09-01T10:39:33', '2022-09-01 16:18:24.000', '2022-09-01T16:15:14', '2022-09-01T13:28:33', '2022-09-01 15:50:54.000', '2022-09-01T15:51:07', '2022-09-01 12:24:26.000','2022-09-01T12:24:07' ],
'year': [ '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022','2022' ],
'month': [9,9,9,9,9,9,9,9,9,9,9 ],
'hour': [ 11,13,11,10,17,16,13,16,16,13,12 ],
'eventcode': [ 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J','J' ],
'eventnumber': [ '21', '21', '2', '21', '2', '21', '21', '2', '21', '2','21' ]}
df = pd.DataFrame(data, columns= ['city', 'date', 'year', 'month', 'hour', 'eventcode','eventnumber' ])
print (df)
In this dataset, when looking at the city and date, event 2 occurs 1 time before event 21, and event 21 occurs 3 times before event 2.
The code below works for counting the number of occurences of event 21 happening before 2. But when I switch the 2 and 21 in the code, it results in an infinite loop…
bc=np.array(df['city'])
un_bc,bc_index,bc_count=np.unique(bc,return_counts=True,return_index=True)
new_df=pd.DataFrame()
count=0
for i,j in zip(bc_index,bc_count):
j=j+i-1
while i+1 <= j:
if df.iat[i,7]==21 and df.iat[i+1,7]==2:
count +=1
new_df=new_df.append(df[i:i+2])
i +=1
print(count)
Code above works, code below does not.
for i,j in zip(bc_index,bc_count):
j=j+i-1
while i+1 <= j:
if df.iat[i,7]==2 and df.iat[i+1,7]==21:
count +=1
new_df=new_df.append(df[i:i+2])
i +=1
print(count)
>Solution :
One solution could be as follows:
df['date'] = pd.to_datetime(df['date'])
res = df.sort_values('date').astype({'eventnumber':int})\
.groupby('city')['eventnumber'].diff().value_counts()
res.index = res.index.map({-19:'21_before_2', 19: '2_before_21'})
print(res)
21_before_2 3
2_before_21 1
Name: eventnumber, dtype: int64
Or, if you don’t want to turn df['date'] into datetime, use .sort_values(['year', 'month', 'hour']).
Explanation
- First, use
pd.to_datetimeto turn date strings into datetime. - Next, 1) sort the
dfondate(usingdf.sort_values), 2) turneventnumberstrings intoints(usingdf.astype), 3) applydf.groupbyon columncity, 4) retrievediff, and 5) retrieveSeries.value_counts. - The result will be a count for
-19(21 before 2) and19(2 before 21), with all cities with just one value being dropped. These values will end up as the index, so we can usemapto assign the appropriate labels.