I have a dataframe where two of the columns represent the start and end date of the data record. There are multiple years. My goal is to assign a new column that represents the time step of the data record in each row. Since I have a location columns as well, some of these weeks will be repeating.
import pandas as pd
dates = pd.date_range(start='2021-11-11', periods=20, freq='W')
df = pd.DataFrame({
'start_date': np.repeat(dates, 5),
'end_date': np.repeat(dates + pd.DateOffset(days=6), 5),
'country': ['USA', 'Canada', 'UK', 'Australia', 'Russia'] * 20
})
df = df.sort_values("start_date")
start_date end_date country
0 2021-11-14 2021-11-20 USA
1 2021-11-14 2021-11-20 Canada
2 2021-11-14 2021-11-20 UK
3 2021-11-14 2021-11-20 Australia
4 2021-11-14 2021-11-20 Russia
I can get the week number using isocalendar().week, but it is giving the week number of the corresponding year. For instance, if 2021-11-14 and 2021-11-20 is the first week in the data frame, it should get 1. It may skip the next week, and have another record starting from 2021-11-27. Such time step should be the second week for me in the data frame.
>Solution :
IIUC, you can use groupby_ngroup:
df['week'] = df.groupby(df['start_date']).ngroup().add(1)
print(df)
# Output
start_date end_date country week
0 2021-11-14 2021-11-20 USA 1
1 2021-11-14 2021-11-20 Canada 1
2 2021-11-14 2021-11-20 UK 1
3 2021-11-14 2021-11-20 Australia 1
4 2021-11-14 2021-11-20 Russia 1
.. ... ... ... ...
98 2022-03-27 2022-04-02 Australia 20
95 2022-03-27 2022-04-02 USA 20
96 2022-03-27 2022-04-02 Canada 20
97 2022-03-27 2022-04-02 UK 20
99 2022-03-27 2022-04-02 Russia 20
[100 rows x 4 columns]
Alternative with pd.factorize (IF the dataframe is already sorted by start_date value:
df['week'] = pd.factorize(df['start_date'])[0] + 1