Suppose the following dataframe in which one column (df.VALUE
) contains only string values and several other columns which contain a combination of null and floats.
df = pd.DataFrame({
'VALUE': ['A','B','C','D','E'],
'CAT1' : [np.nan,1,1,1,1],
'CAT2': [1,1,np.nan,np.nan,1]})
I’d like to create a dictionary in which the non-null for values for each column are replaced by the corresponding value from the VALUE
column and listed with the associated column name as key:
{'CAT1': ['B', 'C', 'D', 'E'], 'CAT2': ['A', 'B', 'E']}
So far, I achieve this using pd.where
and the to_dict
methods to create the dictionary. This includes null values in the lists and so a for loop with list comprehension is used to retain only the str
values:
d1 = df.iloc[:,1:].where(df.isna(), df['VALUE'], axis=0).to_dict(orient='list')
for key in d1.keys():
d1[key] = [x for x in d1[key] if type(x) == str]
List comprehension within a for loop doesn’t seem to be the most efficient solution here and I hoped someone could suggest a more elegant approach please.
>Solution :
Using a simple dictionary comprehension and boolean indexing:
cols = df.columns.difference(['VALUE'])
out = {c: df.loc[df[c].eq(1), 'VALUE'].tolist() for c in cols}
Output:
{'CAT1': ['B', 'C', 'D', 'E'], 'CAT2': ['A', 'B', 'E']}
For completeness, a pure pandas approach could be:
out = (df.set_index('VALUE').stack().reset_index()
.groupby('level_1', sort=False)['VALUE'].agg(list).to_dict()
)
But it’s very likely less efficient.