Home Insert a new row in a Pandas dataframe missing values found in other rows and columns

Questions

Insert a new row in a Pandas dataframe missing values found in other rows and columns

June 26, 2023

I have a very large but somewhat incomplete Pandas dataframe in Python where I need to insert rows that are missing based on values found in other rows and columns. An example is something like this.

import pandas as pd
df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'status'])
df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3']
df_test.quarter = ['2022q4', '2023q2', '2022q3', '2022q4', '2023q1']
df_test.status = ['in', 'in', 'in', 'in', 'in']

What I have are different files that were used during different quarters, and if they were used, the ‘status’ for that file and quarter is set to ‘in’. What I want to do in this dataframe is insert rows for when the file was not used and set the status to ‘out’ for the correct quarter. If the file was not used for two quarters or more in a row, only a single new entry that reads ‘out’ is needed for the first quarter when it was not used.

In this example, it means that for ‘file_1’, a new row should be added for quarter = ‘2023q1’ with status = ‘out’. For ‘file_2’ a new row for quarter ‘2023q1’ with status ‘out’ should be added, but nothing new for ‘2023q2’ is needed. For ‘file_3’, just ‘2023q2’ and ‘out’ is needed.

I suppose that I should use a list like the one below and check if a unique file name has all entries in the list or not and from there create new rows, but I’m not sure how to do it. Any help at all is really appreciated, thanks!

quarters = ['2022q3', '2022q4', '2023q1', '2023q2']

Expected output:

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

>Solution :

One classical approach to fill missing values is to stack/unstack (or melt/pivot).

Here you can do that and use ffill to add missing values that follow an "in":

tmp = df_test.set_index(['file', 'quarter']).unstack()
out = tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1)).stack().reset_index()

Output:

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

Intermediate:

tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1))

        status                     
quarter 2022q3 2022q4 2023q1 2023q2
file                               
file_1     NaN     in    out     in
file_2      in     in    out    NaN
file_3     NaN    NaN     in    out