I have a dataframe as follows:
df =
A col_1 col_45 col_9 col_10
1.0 4.0 45.0 NaN 34.9 NaN
2.0 4.0 NaN NaN 23.4 45.6
3.0 49.2 10.8 NaN NaN 23.8
For every row in the dataframe, I want to check if there is any value present for col_1, col_45, col_9 and col_10. If there is a value present, I want to get the number for the name of the column and put in a list/array. For example.
In the first row,there are values present for col_1 and col_9. So I want to get the numbers 1 and 9 from the column names and put in a list/array. This should be done for every row.
Is there an easy way to do this?
>Solution :
For column names, you could use a stack (to get rid of all NaN automatically), then a GroupBy.agg to form the lists:
out = (df.set_index('A', append=True)
.stack().reset_index(-1)
.groupby(level=0)['level_2'].agg(list)
)
output:
1.0 [col_1, col_9]
2.0 [col_9, col_10]
3.0 [col_1, col_10]
Name: level_2, dtype: object
For digits:
out = (df.set_index('A', append=True)
.stack().reset_index(-1)
['level_2'].str.extract('_(\d+)', expand=False).astype(int)
.groupby(level=0).agg(list)
)
output:
1.0 [1, 9]
2.0 [9, 10]
3.0 [1, 10]
Name: level_2, dtype: object
Alternative
Other approach, you can use a mask and a matrix multiplication:
cols = df.filter(like='col').columns
# Index(['col_1', 'col_45', 'col_9', 'col_10'], dtype='object')
int_cols = cols.str.extract('_(\d+)$', expand=False).astype(int)
# Int64Index([1, 45, 9, 10], dtype='int64')
m = df[cols].notna()
# col_1 col_45 col_9 col_10
# 1.0 True False True False
# 2.0 False False True True
# 3.0 True False False True
out = (m.astype(int).mul(int_cols).where(m).convert_dtypes()
.stack().groupby(level=0).agg(list)
)
# 1.0 [1, 9]
# 2.0 [9, 10]
# 3.0 [1, 10]
# dtype: object