Let’s consider this dataframe:
temp = pd.DataFrame({'x': [['ab', 'bc'], ['hg'], np.nan]})
temp
x
0 [ab, bc]
1 [hg]
2 NaN
I’d like to create a new column called dummy that takes the value of 1 if a row contains letter ‘a’ in any of its elements, value of 0 if it does not, and value of NaN if it’s NaN.
Expected outcome:
x dummy
0 [ab, bc] 1
1 [hg] 0
2 NaN NaN
Sounds simple but I’m stuck. What I’ve tried:
1)
temp['dummy'] = np.where(temp.x.str.contains('a', case = False, na = False), 1, 0)
will assign all 0s because it compares the whole list to ‘a’
2)
temp['dummy'] = np.where(temp.x.astype(str).str.contains('a', case = False, na = False), 1, 0)
atype(str) will take care of the above issue by flattening the a list as a string but now np.NaN is ‘nan’ and na = False does not work on it.
3)
temp['dummy'] = np.where(all([temp.x.astype(str).str.contains('a', case = False, na = False) , temp.x.astype(str) != 'nan']), 1, 0)
I think my second condition should take care of the above issue but now I get error : ValueError: The truth value of a Series is ambiguous.
4)
temp['dummy'] = [1 if all(['a' in y , y != np.nan]) else 0 for y in temp.x ]
Error: TypeError: argument of type 'float' is not iterable
5)
The only thing that will work is:
temp['dummy'] = np.nan # placeholder
temp['dummy'][temp.x.notnull()] = np.where(temp[temp.x.notnull()].x.astype(str).str.contains('a', case = False, na = False), 1, 0)
temp
But it’s two lines and ugly.
>Solution :
Since pandas is not designed to work with nested components, use a list comprehension:
from collections.abc import Iterable
temp['dummy'] = [int(any('a' in item for item in x))
if isinstance(x, Iterable)
else x for x in temp['x']]
Variant (which might not necessarily be faster):
m = temp['x'].notna()
temp.loc[m, 'dummy'] = [int(any('a' in item for item in x))
for x in temp.loc[m, 'x']]
Or a pure pandas approach with explode, str.contains and aggregation with groupby.any:
temp['dummy'] = (temp['x'].explode()
.dropna()
.str.contains('a')
.groupby(level=0).any()
.astype(float)
)
Output:
x dummy
0 [ab, bc] 1.0
1 [hg] 0.0
2 NaN NaN
timings
with random items among [['ab', 'bc'], ['hg'], np.nan]:
with random items with list of 100 items and NaNs:
In this case the pure pandas approach is slower as it does not benefit from short-circuiting like any.

