Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python pandas use fillna() to avoid error splitting on NaN value

my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': ['+1-2', '+2-4'], 'col3': [np.nan, '+2-4'], 'col4': [np.nan, np.nan] })

my_df[['col0a', 'col0b']] = my_df['col0'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col1a', 'col1b']] = my_df['col1'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col2a', 'col2b']] = my_df['col2'].str.split('(?<=[\d]|K)(?=[+-])', expand=True)
my_df[['col3a', 'col3b']] = my_df['col3'].str.split('(?<=[\d]|K)(?=[+-])', expand=True)
my_df[['col4a', 'col4b']] = my_df['col4'].str.split('(?<=[\d]|K)(?=[+-])', expand=True)

my_df

This throws an error on the line my_df[['col4a', 'col4b']] since the entire column is NaN values. For col1, there is no error thrown since we fillna with %, and the split then splits on the % resulting in two columns with empty strings, which is exactly what we want.

What fillna() can we add to col4 such that it won’t throw an error? The split for col4 is more complex (looking for the pattern of +#-#) than the split for col1. We’ve tried fillna('+-') but this does not work. In general, we are looking for a solution that outputs empty strings for col4a and col4b when col4 has NaN as its value.

Edit: worded differently, I need a string for fillna that, when passed to .split('(?<=[\d]|K)(?=[+-])', expand=True), would result in 2 empty strings, although perhaps I can replace with something like fillna(+9999-9999) and then replace 9999 with empty strings…

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

A (probably) better solution is to use .str.extract instead of .str.split, which is better-suited for your purposes:

my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': ['+1-2', '+2-4'], 'col3': [np.nan, '+2-4'], 'col4': [np.nan, np.nan] })
my_df = my_df.fillna('')

pat1 = r'(\d+%)(\d+%)'
pat2 = r'([+-](?:[\d]+|K))([+-](?:[\d]+|K))'

my_df[['col0a', 'col0b']] = my_df['col0'].str.extract(pat1)
my_df[['col1a', 'col1b']] = my_df['col1'].str.extract(pat1)
my_df[['col2a', 'col2b']] = my_df['col2'].str.extract(pat2)
my_df[['col3a', 'col3b']] = my_df['col3'].str.extract(pat2)
my_df[['col4a', 'col4b']] = my_df['col4'].str.extract(pat2)
my_df = my_df.fillna('')

Output:

>>> my_df
   col0 col1  col2  col3 col4 col0a col0b col1a col1b col2a col2b col3a col3b col4a col4b
0  1%2%       +1-2               1%    2%                +1    -2                        
1  3%4%       +2-4  +2-4         3%    4%                +2    -4    +2    -4
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading