Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas filter rows based on certain number of certain columns being NaN

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A   NaN       4       5       NaN     4       7       NaN       6       8
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

I want to remove the rows where at least three of the columns marked ‘T01’ are NaN

So the output would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.

I know this could be simple to do, I wrote:

import sys
import pandas as pd

df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])

To first get all of the cells with T01 in them, and then I was going to count them.

I got the error:

    print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable

Then I thought about iterating through the rows and counting instead e.g.:

for index,row in df.iterrows():
        if 'T01' in row:
                print(row)

This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?

>Solution :

If you select only the ‘T01’ columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.

df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading