I’m trying to create rows for missing dates so my df contains all dates in 2023.
I’m trying this, but missing something:
ValueError: cannot reindex on an axis with duplicate labels
import pandas as pd
data = {"eventDate":["2023-01-01", "2023-01-01", "2023-12-10"],
"col1":[1,2,3], "col2":["a","b","c"]}
df = pd.DataFrame(data)
#What I have:
# eventDate col1 col2
# 2023-01-01 1 a
# 2023-01-01 2 b
# 2023-12-10 3 c
#What I want:
# eventDate col1 col2
# 2023-01-01 1 a
# 2023-01-01 2 b
# 2023-01-02 NaN NaN
# 2023-01-03 NaN NaN
# ...
# 2023-12-10 3 c
# ...
# 2023-12-31 NaN NaN
df["eventDate"] = pd.to_datetime(df["eventDate"])
df = df.set_index("eventDate")
print(df.index)
#DatetimeIndex(['2023-01-01', '2023-01-01', '2023-12-10'], dtype='datetime64[ns]', name='eventDate', freq=None)
idx = pd.date_range('2023-01-01', '2023-12-31')
print(idx.duplicated().any())
#False
df = df.reindex(idx)
#ValueError: cannot reindex on an axis with duplicate labels
>Solution :
You could get the difference of indexes, then concat:
df = pd.DataFrame(data)
df["eventDate"] = pd.to_datetime(df["eventDate"])
df = df.set_index("eventDate")
out = pd.concat([df, pd.DataFrame(index=idx.difference(df.index))]).sort_index()
Output:
col1 col2
2023-01-01 1.0 a
2023-01-01 2.0 b
2023-01-02 NaN NaN
2023-01-03 NaN NaN
2023-01-04 NaN NaN
... ... ...
2023-12-10 3.0 c
... ... ...
2023-12-27 NaN NaN
2023-12-28 NaN NaN
2023-12-29 NaN NaN
2023-12-30 NaN NaN
2023-12-31 NaN NaN
[366 rows x 2 columns]