Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to produce monthly count when given a date range in pandas?

I have a dataframe that records users, a label, and the start and end date of them being labelled as such
e.g.

user label start_date end_date
1 x 2018-01-01 2018-10-01
2 x 2019-05-10 2020-01-01
3 y 2019-04-01 2022-04-20
1 b 2018-10-01 2020-05-08

etc

where each row is for a given user and a label; a user appears multiple times for different labels

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I want to get a count of users for every month for each label, such as this:

date count_label_x count_label_y count_label_b count_label_
2018-01 10 0 20 5
2018-02 2 5 15 3
2018-03 20 6 8 3

etc

where for instance for the first entry of the previous table, that user should be counted once for every month between his start and end date. The problem boils down to this and since I only have a few labels I can filter labels one by one and produce one output for each label. But how do I check and count users given an interval?

Thanks

>Solution :

You can use date_range combined with to_period to generate the active months, then pivot_table with aggfunc='nunique' to aggregate the unique user (if you want to count the duplicated users use aggfunc='count'):

out = (df
   .assign(period=[pd.date_range(a, b, freq='M').to_period('M')
                   for a,b in zip(df['start_date'], df['end_date'])])
   .explode('period')
   .pivot_table(index='period', columns='label', values='user',
                aggfunc='nunique', fill_value=0)
)

output:

label    b  x  y
period          
2018-01  0  1  0
2018-02  0  1  0
2018-03  0  1  0
2018-04  0  1  0
2018-05  0  1  0
...
2021-12  0  0  1
2022-01  0  0  1
2022-02  0  0  1
2022-03  0  0  1

handling NaT

if you have the same start/end and want to count the value:

out = (df
   .assign(period=[pd.date_range(a, b, freq='M').to_period('M')
                   for a,b in zip(df['start_date'], df['end_date'])])
   .explode('period')
   .assign(period=lambda d: d['period'].fillna(d['start_date'].dt.to_period('M')))
   .pivot_table(index='period', columns='label', values='user',
                aggfunc='nunique', fill_value=0)
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading