I am trying to count how often an event occurs before another event occurs per city. To be more specific I am trying to count how often event number 21 occurrs before 2, and the other way around.
The dataset looks like this
import pandas as pd
data = { 'city': ['Amsterdam', 'Vienna', 'Paris', 'Paris', 'Istanbul', 'Istanbul','Delhi', 'London', 'London', 'Barcelona', 'Barcelona'],
'date': [ '2022-09-01T11:34:53', '2022-09-01T13:37:37', '2022-09-01 10:44:22.000', '2022-09-01T10:39:33', '2022-09-01 16:18:24.000', '2022-09-01T16:15:14', '2022-09-01T13:28:33', '2022-09-01 15:50:54.000', '2022-09-01T15:51:07', '2022-09-01 12:24:26.000','2022-09-01T12:24:07' ],
'year': [ '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022','2022' ],
'month': [9,9,9,9,9,9,9,9,9,9,9 ],
'hour': [ 11,13,11,10,17,16,13,16,16,13,12 ],
'eventcode': [ 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J','J' ],
'eventnumber': [ '21', '21', '2', '21', '2', '21', '21', '2', '21', '2','21' ]}
df = pd.DataFrame(data, columns= ['city', 'date', 'year', 'month', 'hour', 'eventcode','eventnumber' ])
print (df)
In this dataset, when looking at the city and date, event 2 occurs 1 time before event 21, and event 21 occurs 3 times before event 2.
The code below works for counting the number of occurences of event 21 happening before 2. But when I switch the 2 and 21 in the code, it results in an infinite loop…
bc=np.array(df['city'])
un_bc,bc_index,bc_count=np.unique(bc,return_counts=True,return_index=True)
new_df=pd.DataFrame()
count=0
for i,j in zip(bc_index,bc_count):
j=j+i-1
while i+1 <= j:
if df.iat[i,7]==21 and df.iat[i+1,7]==2:
count +=1
new_df=new_df.append(df[i:i+2])
i +=1
print(count)
Code above works, code below does not.
for i,j in zip(bc_index,bc_count):
j=j+i-1
while i+1 <= j:
if df.iat[i,7]==2 and df.iat[i+1,7]==21:
count +=1
new_df=new_df.append(df[i:i+2])
i +=1
print(count)
>Solution :
One solution could be as follows:
df['date'] = pd.to_datetime(df['date'])
res = df.sort_values('date').astype({'eventnumber':int})\
.groupby('city')['eventnumber'].diff().value_counts()
res.index = res.index.map({-19:'21_before_2', 19: '2_before_21'})
print(res)
21_before_2 3
2_before_21 1
Name: eventnumber, dtype: int64
Or, if you don’t want to turn df['date']
into datetime
, use .sort_values(['year', 'month', 'hour'])
.
Explanation
- First, use
pd.to_datetime
to turn date strings into datetime. - Next, 1) sort the
df
ondate
(usingdf.sort_values
), 2) turneventnumber
strings intoints
(usingdf.astype
), 3) applydf.groupby
on columncity
, 4) retrievediff
, and 5) retrieveSeries.value_counts
. - The result will be a count for
-19
(21 before 2) and19
(2 before 21), with all cities with just one value being dropped. These values will end up as the index, so we can usemap
to assign the appropriate labels.