Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

how to get a unique week number for start and end dates in multi years – Pandas

I have a dataframe where two of the columns represent the start and end date of the data record. There are multiple years. My goal is to assign a new column that represents the time step of the data record in each row. Since I have a location columns as well, some of these weeks will be repeating.

import pandas as pd


dates = pd.date_range(start='2021-11-11', periods=20, freq='W')


df = pd.DataFrame({
    'start_date': np.repeat(dates, 5),
    'end_date': np.repeat(dates + pd.DateOffset(days=6), 5),
    'country': ['USA', 'Canada', 'UK', 'Australia', 'Russia'] * 20
})

df = df.sort_values("start_date")


    start_date  end_date    country 
0   2021-11-14  2021-11-20  USA 
1   2021-11-14  2021-11-20  Canada
2   2021-11-14  2021-11-20  UK
3   2021-11-14  2021-11-20  Australia   
4   2021-11-14  2021-11-20  Russia  

I can get the week number using isocalendar().week, but it is giving the week number of the corresponding year. For instance, if 2021-11-14 and 2021-11-20 is the first week in the data frame, it should get 1. It may skip the next week, and have another record starting from 2021-11-27. Such time step should be the second week for me in the data frame.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

IIUC, you can use groupby_ngroup:

df['week'] = df.groupby(df['start_date']).ngroup().add(1)
print(df)

# Output
   start_date   end_date    country  week
0  2021-11-14 2021-11-20        USA     1
1  2021-11-14 2021-11-20     Canada     1
2  2021-11-14 2021-11-20         UK     1
3  2021-11-14 2021-11-20  Australia     1
4  2021-11-14 2021-11-20     Russia     1
..        ...        ...        ...   ...
98 2022-03-27 2022-04-02  Australia    20
95 2022-03-27 2022-04-02        USA    20
96 2022-03-27 2022-04-02     Canada    20
97 2022-03-27 2022-04-02         UK    20
99 2022-03-27 2022-04-02     Russia    20

[100 rows x 4 columns]

Alternative with pd.factorize (IF the dataframe is already sorted by start_date value:

df['week'] = pd.factorize(df['start_date'])[0] + 1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading