Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filtering a pandas series consisting of lists and NaN values if the elements contain a string

Let’s consider this dataframe:

temp = pd.DataFrame({'x': [['ab', 'bc'], ['hg'], np.nan]})
temp

    x
0   [ab, bc]
1   [hg]
2   NaN

I’d like to create a new column called dummy that takes the value of 1 if a row contains letter ‘a’ in any of its elements, value of 0 if it does not, and value of NaN if it’s NaN.

Expected outcome:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    x         dummy
0   [ab, bc]   1
1   [hg]       0
2   NaN        NaN

Sounds simple but I’m stuck. What I’ve tried:

1)

temp['dummy'] = np.where(temp.x.str.contains('a', case = False, na = False), 1, 0)

will assign all 0s because it compares the whole list to ‘a’

2)

temp['dummy'] = np.where(temp.x.astype(str).str.contains('a', case = False, na = False), 1, 0)

atype(str) will take care of the above issue by flattening the a list as a string but now np.NaN is ‘nan’ and na = False does not work on it.

3)

temp['dummy'] = np.where(all([temp.x.astype(str).str.contains('a', case = False, na = False) , temp.x.astype(str) != 'nan']), 1, 0)

I think my second condition should take care of the above issue but now I get error : ValueError: The truth value of a Series is ambiguous.

4)

temp['dummy'] = [1 if all(['a' in y , y != np.nan]) else 0 for y in temp.x ]

Error: TypeError: argument of type 'float' is not iterable

5)

The only thing that will work is:

temp['dummy'] = np.nan # placeholder
temp['dummy'][temp.x.notnull()] = np.where(temp[temp.x.notnull()].x.astype(str).str.contains('a', case = False, na = False), 1, 0)
temp

But it’s two lines and ugly.

>Solution :

Since pandas is not designed to work with nested components, use a list comprehension:

from collections.abc import Iterable

temp['dummy'] = [int(any('a' in item for item in x))
                 if isinstance(x, Iterable)
                 else x for x in temp['x']]

Variant (which might not necessarily be faster):

m = temp['x'].notna()

temp.loc[m, 'dummy'] = [int(any('a' in item for item in x))
                        for x in temp.loc[m, 'x']]

Or a pure pandas approach with explode, str.contains and aggregation with groupby.any:

temp['dummy'] = (temp['x'].explode()
                 .dropna()
                 .str.contains('a')
                 .groupby(level=0).any()
                 .astype(float)
                )

Output:

          x  dummy
0  [ab, bc]    1.0
1      [hg]    0.0
2       NaN    NaN

timings

with random items among [['ab', 'bc'], ['hg'], np.nan]:

enter image description here

with random items with list of 100 items and NaNs:

enter image description here

In this case the pure pandas approach is slower as it does not benefit from short-circuiting like any.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading