Home Filtering a pandas series consisting of lists and NaN values if the elements contain a string

Questions

Filtering a pandas series consisting of lists and NaN values if the elements contain a string

July 19, 2024

Let’s consider this dataframe:

temp = pd.DataFrame({'x': [['ab', 'bc'], ['hg'], np.nan]})
temp

    x
0   [ab, bc]
1   [hg]
2   NaN

I’d like to create a new column called dummy that takes the value of 1 if a row contains letter ‘a’ in any of its elements, value of 0 if it does not, and value of NaN if it’s NaN.

Expected outcome:

    x         dummy
0   [ab, bc]   1
1   [hg]       0
2   NaN        NaN

Sounds simple but I’m stuck. What I’ve tried:

temp['dummy'] = np.where(temp.x.str.contains('a', case = False, na = False), 1, 0)

will assign all 0s because it compares the whole list to ‘a’

temp['dummy'] = np.where(temp.x.astype(str).str.contains('a', case = False, na = False), 1, 0)

atype(str) will take care of the above issue by flattening the a list as a string but now np.NaN is ‘nan’ and na = False does not work on it.

temp['dummy'] = np.where(all([temp.x.astype(str).str.contains('a', case = False, na = False) , temp.x.astype(str) != 'nan']), 1, 0)

I think my second condition should take care of the above issue but now I get error : ValueError: The truth value of a Series is ambiguous.

temp['dummy'] = [1 if all(['a' in y , y != np.nan]) else 0 for y in temp.x ]

Error: TypeError: argument of type 'float' is not iterable

The only thing that will work is:

temp['dummy'] = np.nan # placeholder
temp['dummy'][temp.x.notnull()] = np.where(temp[temp.x.notnull()].x.astype(str).str.contains('a', case = False, na = False), 1, 0)
temp

But it’s two lines and ugly.

>Solution :

Since pandas is not designed to work with nested components, use a list comprehension:

from collections.abc import Iterable

temp['dummy'] = [int(any('a' in item for item in x))
                 if isinstance(x, Iterable)
                 else x for x in temp['x']]

Variant (which might not necessarily be faster):

m = temp['x'].notna()

temp.loc[m, 'dummy'] = [int(any('a' in item for item in x))
                        for x in temp.loc[m, 'x']]

Or a pure pandas approach with explode, str.contains and aggregation with groupby.any:

temp['dummy'] = (temp['x'].explode()
                 .dropna()
                 .str.contains('a')
                 .groupby(level=0).any()
                 .astype(float)
                )

Output:

          x  dummy
0  [ab, bc]    1.0
1      [hg]    0.0
2       NaN    NaN

timings

with random items among [['ab', 'bc'], ['hg'], np.nan]:

with random items with list of 100 items and NaNs:

In this case the pure pandas approach is slower as it does not benefit from short-circuiting like any.

pandas

byMR

Published July 19, 2024

Add a comment

LaTeX Minted Package with Pygments in Texifier on MacOS

byMR

July 19, 2024

Questions

How do I get variable length slices of values using Pandas?

byMR

July 19, 2024

Questions

How to extract colors from a palette in Material 3

byMR

July 19, 2024

Questions

Shell how to pass only some user input to script?

byMR

July 19, 2024

Questions

excel – how to pick the whole row using randbetween

byMR

July 20, 2024

Questions

Replacing part of string with re.sub with number and string

byMR

July 20, 2024

Filtering a pandas series consisting of lists and NaN values if the elements contain a string

MEDevel.com: Open-source for Healthcare and Education

>Solution :

timings

Like this:

Leave a ReplyCancel reply

Read more

LaTeX Minted Package with Pygments in Texifier on MacOS

How do I get variable length slices of values using Pandas?

How to extract colors from a palette in Material 3

Shell how to pass only some user input to script?

excel – how to pick the whole row using randbetween

Replacing part of string with re.sub with number and string

Keep Up to Date with the Most Important News

Filtering a pandas series consisting of lists and NaN values if the elements contain a string

MEDevel.com: Open-source for Healthcare and Education

>Solution :

timings

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

LaTeX Minted Package with Pygments in Texifier on MacOS

How do I get variable length slices of values using Pandas?

How to extract colors from a palette in Material 3

Shell how to pass only some user input to script?

excel – how to pick the whole row using randbetween

Replacing part of string with re.sub with number and string

Discover more from Dev solutions