Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

duplicate rows with different rules for different columns based on availability in a column

I have a DataFrame which looks like this:

   item_id  facility_id  actual_quantity  min_quantity           timestamp
0       11         1185                5             3 2022-01-11 00:00:00
1       12         1185                7             4 2022-01-11 00:00:00
2       23         1185                5             5 2022-01-11 00:00:00
3       34         1185                4             6 2022-01-11 00:00:00
4       56         1185                3             7 2022-01-11 00:00:00
5       67         1185                5             8 2022-01-11 00:00:00
6       12         1185                3             3 2022-01-11 01:00:00
7       23         1185                5             4 2022-01-11 01:00:00
8       34         1185               13             5 2022-01-11 01:00:00
9       45         1185               12             6 2022-01-11 01:00:00
df = pd.DataFrame({'item_id': [11, 12, 23, 34, 56, 67, 12, 23, 34, 45],
                  'facility_id': [1185]*10,
                  'actual_quantity': [5, 7, 5, 4, 3, 5, 3, 5, 13, 12],
                  'min_quantity': [3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
                  'timestamp': [pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 00:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00'),
                                pd.Timestamp('2022-01-11 01:00:00')]})

Here, in timestamp column, we have two unique timestamps. For timestamp1 = Timestamp('2022-01-11 00:00:00'), we have 6 unique item_ids and other columns related to those item_ids, whereas for timestamp2 = Timestamp('2022-01-11 01:00:00'), we have only 4 unique item_ids.

I want to achieve following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Add rows in timestamp2 with missing item_ids. Keep facility_ids for added rows same, i.e. 1185, actual_quantity should be 0, min_quantity should be same as there in timestamp1 for that item_id.

Do similar thing for item_ids in timestamp2 but not in timestamp1.

The output_df should look like this:

    item_id  facility_id  actual_quantity  min_quantity           timestamp
0        11         1185                5             3 2022-01-11 00:00:00
1        12         1185                7             4 2022-01-11 00:00:00
2        23         1185                5             5 2022-01-11 00:00:00
3        34         1185                4             6 2022-01-11 00:00:00
4        56         1185                3             7 2022-01-11 00:00:00
5        67         1185                5             8 2022-01-11 00:00:00
6        45         1185                0             6 2022-01-11 00:00:00
7        12         1185                3             3 2022-01-11 01:00:00
8        23         1185                5             4 2022-01-11 01:00:00
9        34         1185               13             5 2022-01-11 01:00:00
10       45         1185               12             6 2022-01-11 01:00:00
11       11         1185                0             6 2022-01-11 01:00:00
12       56         1185                0             7 2022-01-11 01:00:00
13       67         1185                0             8 2022-01-11 01:00:00

What’s the most pythonic way to achieve this?

>Solution :

You could use pivot + stack + fillna + reset_index:

tmp = df.pivot(['item_id', 'facility_id'], ['timestamp'], ['min_quantity', 'actual_quantity'])
tmp['actual_quantity'] = tmp['actual_quantity'].fillna(0)
tmp['min_quantity'] = tmp['min_quantity'].ffill(axis=1).bfill(axis=1)
out = tmp.stack(level=1).reset_index()[['item_id', 'facility_id', 'actual_quantity', 'min_quantity','timestamp']]

Output:

    item_id  facility_id  actual_quantity  min_quantity           timestamp
0        11         1185              5.0           3.0 2022-01-11 00:00:00
1        11         1185              0.0           3.0 2022-01-11 01:00:00
2        12         1185              7.0           4.0 2022-01-11 00:00:00
3        12         1185              3.0           3.0 2022-01-11 01:00:00
4        23         1185              5.0           5.0 2022-01-11 00:00:00
5        23         1185              5.0           4.0 2022-01-11 01:00:00
6        34         1185              4.0           6.0 2022-01-11 00:00:00
7        34         1185             13.0           5.0 2022-01-11 01:00:00
8        45         1185              0.0           6.0 2022-01-11 00:00:00
9        45         1185             12.0           6.0 2022-01-11 01:00:00
10       56         1185              3.0           7.0 2022-01-11 00:00:00
11       56         1185              0.0           7.0 2022-01-11 01:00:00
12       67         1185              5.0           8.0 2022-01-11 00:00:00
13       67         1185              0.0           8.0 2022-01-11 01:00:00
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading