Dictionary creation from dataframe, replace values, remove nulls. Pandas

Advertisements

Suppose the following dataframe in which one column (df.VALUE) contains only string values and several other columns which contain a combination of null and floats.

df = pd.DataFrame({
    'VALUE': ['A','B','C','D','E'],
    'CAT1' : [np.nan,1,1,1,1],
    'CAT2': [1,1,np.nan,np.nan,1]})

I’d like to create a dictionary in which the non-null for values for each column are replaced by the corresponding value from the VALUE column and listed with the associated column name as key:

{'CAT1': ['B', 'C', 'D', 'E'], 'CAT2': ['A', 'B', 'E']}

So far, I achieve this using pd.where and the to_dict methods to create the dictionary. This includes null values in the lists and so a for loop with list comprehension is used to retain only the str values:

d1 = df.iloc[:,1:].where(df.isna(), df['VALUE'], axis=0).to_dict(orient='list')

for key in d1.keys():
        d1[key] = [x for x in d1[key] if type(x) == str]

List comprehension within a for loop doesn’t seem to be the most efficient solution here and I hoped someone could suggest a more elegant approach please.

>Solution :

Using a simple dictionary comprehension and boolean indexing:

cols = df.columns.difference(['VALUE'])
out = {c: df.loc[df[c].eq(1), 'VALUE'].tolist() for c in cols}

Output:

{'CAT1': ['B', 'C', 'D', 'E'], 'CAT2': ['A', 'B', 'E']}

For completeness, a pure pandas approach could be:

out = (df.set_index('VALUE').stack().reset_index()
         .groupby('level_1', sort=False)['VALUE'].agg(list).to_dict()
      )

But it’s very likely less efficient.

Leave a ReplyCancel reply