I have a dataframe that contains multiple time series, like so:
| Date | Item | Category |
|---|---|---|
| 2021-01-01 | gouda | cheese |
| 2021-01-02 | gouda | cheese |
| 2021-01-04 | gouda | cheese |
| 2021-01-05 | gouda | cheese |
| 2021-02-01 | lettuce | produce |
| 2021-02-02 | lettuce | produce |
| 2021-02-03 | lettuce | produce |
| 2021-02-05 | lettuce | produce |
I’d like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce). Note that these series do not necessarily start and end on the same date.
What is the best way to do this in pandas? I’d also like fill the new rows with the values in the "item" and "category" column for that series.
>Solution :
Group by Item and Category, then generate a time series from the min to the max date:
result = (
df.groupby(["Item", "Category"])["Date"]
.apply(lambda s: pd.date_range(s.min(), s.max()))
.explode()
.reset_index()
)