I have a DataFrame which looks like this:
item_id facility_id actual_quantity min_quantity timestamp
0 11 1185 5 3 2022-01-11 00:00:00
1 12 1185 7 4 2022-01-11 00:00:00
2 23 1185 5 5 2022-01-11 00:00:00
3 34 1185 4 6 2022-01-11 00:00:00
4 56 1185 3 7 2022-01-11 00:00:00
5 67 1185 5 8 2022-01-11 00:00:00
6 12 1185 3 3 2022-01-11 01:00:00
7 23 1185 5 4 2022-01-11 01:00:00
8 34 1185 13 5 2022-01-11 01:00:00
9 45 1185 12 6 2022-01-11 01:00:00
df = pd.DataFrame({'item_id': [11, 12, 23, 34, 56, 67, 12, 23, 34, 45],
'facility_id': [1185]*10,
'actual_quantity': [5, 7, 5, 4, 3, 5, 3, 5, 13, 12],
'min_quantity': [3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
'timestamp': [pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 00:00:00'),
pd.Timestamp('2022-01-11 01:00:00'),
pd.Timestamp('2022-01-11 01:00:00'),
pd.Timestamp('2022-01-11 01:00:00'),
pd.Timestamp('2022-01-11 01:00:00')]})
Here, in timestamp column, we have two unique timestamps. For timestamp1 = Timestamp('2022-01-11 00:00:00'), we have 6 unique item_ids and other columns related to those item_ids, whereas for timestamp2 = Timestamp('2022-01-11 01:00:00'), we have only 4 unique item_ids.
I want to achieve following:
Add rows in timestamp2 with missing item_ids. Keep facility_ids for added rows same, i.e. 1185, actual_quantity should be 0, min_quantity should be same as there in timestamp1 for that item_id.
Do similar thing for item_ids in timestamp2 but not in timestamp1.
The output_df should look like this:
item_id facility_id actual_quantity min_quantity timestamp
0 11 1185 5 3 2022-01-11 00:00:00
1 12 1185 7 4 2022-01-11 00:00:00
2 23 1185 5 5 2022-01-11 00:00:00
3 34 1185 4 6 2022-01-11 00:00:00
4 56 1185 3 7 2022-01-11 00:00:00
5 67 1185 5 8 2022-01-11 00:00:00
6 45 1185 0 6 2022-01-11 00:00:00
7 12 1185 3 3 2022-01-11 01:00:00
8 23 1185 5 4 2022-01-11 01:00:00
9 34 1185 13 5 2022-01-11 01:00:00
10 45 1185 12 6 2022-01-11 01:00:00
11 11 1185 0 6 2022-01-11 01:00:00
12 56 1185 0 7 2022-01-11 01:00:00
13 67 1185 0 8 2022-01-11 01:00:00
What’s the most pythonic way to achieve this?
>Solution :
You could use pivot + stack + fillna + reset_index:
tmp = df.pivot(['item_id', 'facility_id'], ['timestamp'], ['min_quantity', 'actual_quantity'])
tmp['actual_quantity'] = tmp['actual_quantity'].fillna(0)
tmp['min_quantity'] = tmp['min_quantity'].ffill(axis=1).bfill(axis=1)
out = tmp.stack(level=1).reset_index()[['item_id', 'facility_id', 'actual_quantity', 'min_quantity','timestamp']]
Output:
item_id facility_id actual_quantity min_quantity timestamp
0 11 1185 5.0 3.0 2022-01-11 00:00:00
1 11 1185 0.0 3.0 2022-01-11 01:00:00
2 12 1185 7.0 4.0 2022-01-11 00:00:00
3 12 1185 3.0 3.0 2022-01-11 01:00:00
4 23 1185 5.0 5.0 2022-01-11 00:00:00
5 23 1185 5.0 4.0 2022-01-11 01:00:00
6 34 1185 4.0 6.0 2022-01-11 00:00:00
7 34 1185 13.0 5.0 2022-01-11 01:00:00
8 45 1185 0.0 6.0 2022-01-11 00:00:00
9 45 1185 12.0 6.0 2022-01-11 01:00:00
10 56 1185 3.0 7.0 2022-01-11 00:00:00
11 56 1185 0.0 7.0 2022-01-11 01:00:00
12 67 1185 5.0 8.0 2022-01-11 00:00:00
13 67 1185 0.0 8.0 2022-01-11 01:00:00