Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculate list of missing columns id's

I would want to calculate a list of missing columns id’s.

Current_dataframe:

data = {'DIVISION':  ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
        'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
        'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
        'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
        'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
        'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
        'OVERWRITE':[np.nan,0,np.nan,1,2]
        }
df = pd.DataFrame(data)

Expected output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data = {'DIVISION':  ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
        'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
        'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
        'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
        'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
        'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
        'OVERWRITE':[np.nan,0,np.nan,1,2],
        'MISSING_ID':[[1,3],[1,3],np.nan,[3],[0,1,3]]
        }
expected = pd.DataFrame(data)

Where ‘MISSING_ID’ is based on missing values of columns DIVISION, STORE_ID, FORECAST_DATE, START_TIME and FULFILMENT_TYPE

I can achieve this by writing multiple np.where conditions and then using a list:

multiplier_df['DIV_MISSING'] = np.where(multiplier_df.DIVISION.isna(),1,0)
multiplier_df['STORE_MISSING'] = np.where(multiplier_df.STORE_ID.isna(),2,0)
multiplier_df['DATE_MISSING'] = np.where(multiplier_df.FORECAST_DATE.isna(),3,0)
multiplier_df['HOUR_MISSING'] = np.where(multiplier_df.START_TIME.isna(),4,0)
multiplier_df['FULFILLMENT_MISSING'] = np.where(multiplier_df.FULFILLMENT_TYPE.isna(),5,0)
multiplier_df['MISSING_ID'] = multiplier_df[['DIV_MISSING', 'STORE_MISSING','DATE_MISSING','HOUR_MISSING','FULFILLMENT_MISSING']].apply(list, axis=1)

However this results in 0’s which are unwanted and thus looking for a simpler way to do it. I see a similar solution in R here : " Return list of column names with missing (NA) data for each row of a data frame in R "
But need to do in Pandas

>Solution :

# 1) Take part of dataframe, only those columns in which we search for nans
df_part = df[['DIVISION', 'STORE_ID', 'FORECAST_DATE', 'START_TIME', 'FULFILLMENT_TYPE']]

# use pd.isna(df).values to get np.array of True/False indicating where nans are
# then use list comprehension and np.where to determine for each row which columns have nans
df['MISSING_ID'] = [np.where(row)[0] for row in pd.isna(df_part).values]
print(df)
  DIVISION STORE_ID FORECAST_DATE  ...  MULTIPLIER OVERWRITE  MISSING_ID
0   DENVER      NaN    2021-08-15  ...        1.04       NaN      [1, 3]
1    JEWEL      NaN    2021-08-15  ...        1.10       0.0      [1, 3]
2    JEWEL     0001    2021-08-15  ...        1.04       NaN          []
3   DENVER     0001    2021-08-15  ...        1.10       1.0         [3]
4      NaN      NaN    2021-08-15  ...        0.90       2.0   [0, 1, 3]

For now it gives [] instead of np.nan, but you can fix this easily.

df['MISSING_ID'] = df['MISSING_ID'].apply(lambda x: x if len(x) else np.nan)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading