Home duplicate rows with different rules for different columns based on availability in a column

Questions

duplicate rows with different rules for different columns based on availability in a column

March 3, 2022

I have a DataFrame which looks like this:

   item_id  facility_id  actual_quantity  min_quantity           timestamp
0       11         1185                5             3 2022-01-11 00:00:00
1       12         1185                7             4 2022-01-11 00:00:00
2       23         1185                5             5 2022-01-11 00:00:00
3       34         1185                4             6 2022-01-11 00:00:00
4       56         1185                3             7 2022-01-11 00:00:00
5       67         1185                5             8 2022-01-11 00:00:00
6       12         1185                3             3 2022-01-11 01:00:00
7       23         1185                5             4 2022-01-11 01:00:00
8       34         1185               13             5 2022-01-11 01:00:00
9       45         1185               12             6 2022-01-11 01:00:00

df = pd.DataFrame({'item_id': [11, 12, 23, 34, 56, 67, 12, 23, 34, 45],
                  'facility_id': [1185]*10,
                  'actual_quantity': [5, 7, 5, 4, 3, 5, 3, 5, 13, 12],
                  'min_quantity': [3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
                  'timestamp': [pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00')]})

Here, in timestamp column, we have two unique timestamps. For timestamp1 = Timestamp('2022-01-11 00:00:00'), we have 6 unique item_ids and other columns related to those item_ids, whereas for timestamp2 = Timestamp('2022-01-11 01:00:00'), we have only 4 unique item_ids.

I want to achieve following:

Add rows in timestamp2 with missing item_ids. Keep facility_ids for added rows same, i.e. 1185, actual_quantity should be 0, min_quantity should be same as there in timestamp1 for that item_id.

Do similar thing for item_ids in timestamp2 but not in timestamp1.

The output_df should look like this:

    item_id  facility_id  actual_quantity  min_quantity           timestamp
0        11         1185                5             3 2022-01-11 00:00:00
1        12         1185                7             4 2022-01-11 00:00:00
2        23         1185                5             5 2022-01-11 00:00:00
3        34         1185                4             6 2022-01-11 00:00:00
4        56         1185                3             7 2022-01-11 00:00:00
5        67         1185                5             8 2022-01-11 00:00:00
6        45         1185                0             6 2022-01-11 00:00:00
7        12         1185                3             3 2022-01-11 01:00:00
8        23         1185                5             4 2022-01-11 01:00:00
9        34         1185               13             5 2022-01-11 01:00:00
10       45         1185               12             6 2022-01-11 01:00:00
11       11         1185                0             6 2022-01-11 01:00:00
12       56         1185                0             7 2022-01-11 01:00:00
13       67         1185                0             8 2022-01-11 01:00:00

What’s the most pythonic way to achieve this?

>Solution :

You could use pivot + stack + fillna + reset_index:

tmp = df.pivot(['item_id', 'facility_id'], ['timestamp'], ['min_quantity', 'actual_quantity'])
tmp['actual_quantity'] = tmp['actual_quantity'].fillna(0)
tmp['min_quantity'] = tmp['min_quantity'].ffill(axis=1).bfill(axis=1)
out = tmp.stack(level=1).reset_index()[['item_id', 'facility_id', 'actual_quantity', 'min_quantity','timestamp']]

Output:

    item_id  facility_id  actual_quantity  min_quantity           timestamp
0        11         1185              5.0           3.0 2022-01-11 00:00:00
1        11         1185              0.0           3.0 2022-01-11 01:00:00
2        12         1185              7.0           4.0 2022-01-11 00:00:00
3        12         1185              3.0           3.0 2022-01-11 01:00:00
4        23         1185              5.0           5.0 2022-01-11 00:00:00
5        23         1185              5.0           4.0 2022-01-11 01:00:00
6        34         1185              4.0           6.0 2022-01-11 00:00:00
7        34         1185             13.0           5.0 2022-01-11 01:00:00
8        45         1185              0.0           6.0 2022-01-11 00:00:00
9        45         1185             12.0           6.0 2022-01-11 01:00:00
10       56         1185              3.0           7.0 2022-01-11 00:00:00
11       56         1185              0.0           7.0 2022-01-11 01:00:00
12       67         1185              5.0           8.0 2022-01-11 00:00:00
13       67         1185              0.0           8.0 2022-01-11 01:00:00