I would like to count the number of events that occured up until a date that is specified in the row.
I want to know how many events took place per ID up until the specified date.
Original dataframe:
Date | ID | Event |
---|---|---|
23.01.23 | 1 | AA |
19.01.23 | 1 | AB |
23.12.22 | 1 | AA |
23.01.23 | 2 | AA |
19.01.23 | 2 | AA |
23.12.22 | 2 | AB |
Expected result:
Date | ID | Event | Count of AA | Count of AB |
---|---|---|---|---|
23.01.23 | 1 | AA | 2 | 1 |
19.01.23 | 1 | AB | 1 | 1 |
23.12.22 | 1 | AA | 1 | 0 |
23.01.23 | 2 | AA | 2 | 1 |
19.01.23 | 2 | AA | 1 | 1 |
23.12.22 | 2 | AB | 0 | 1 |
I was thinking to do a groupby, then pivot the table but the results were incorrect. Then I tried to do it SQL style and I almost have it but it’s not yet what I need.
d = {'Date': ["23.01.23", "19.01.23", "23.12.22", "23.01.23", "19.01.23", "23.12.22"],'ID': [1,1,1,2,2,2], "Event": ["AA","AB","AA","AA","AA","AB"]}
test_df = pd.DataFrame(data = d)
import duckdb
duckdb.query("SELECT Date, Id, Event, COUNT() OVER(PARTITION BY ID, event ) as 'count' FROM test_df").df()
outputs:
The SQL query is missing a distinct within the partition by but I didn’t manage to make it work.
Any type of solution (pandas, python, SQL) is greatly appreciated.
>Solution :
You can use crosstab
+groupby.cumsum
and merge
:
# ensure datetime
test_df['Date'] = pd.to_datetime(test_df['Date'])
# compute crosstab, cumsum per group and merge
test_df.merge((pd.crosstab([test_df['Date'], test_df['ID']], test_df['Event'])
.groupby(level='ID').cumsum()
), left_on=['Date', 'ID'], right_index=True
)
Output:
Date ID Event AA AB
0 2023-01-23 1 AA 2 1
1 2023-01-19 1 AB 1 1
2 2022-12-23 1 AA 1 0
3 2023-01-23 2 AA 2 1
4 2023-01-19 2 AA 1 1
5 2022-12-23 2 AB 0 1