Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Looping over text files and get rows with NaN values in column in python

I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA and add the filename as additional column to those extracted rows.

So for example:

File1

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.24055  0.31092   0.03447   0.00015   0.93464   0.08232  0.52609  0.00560  0.44018  0.06337   236   770
1  0.43976  0.45359   0.01220   0.93317   0.05711   0.06316  0.49310  0.05882  0.51825  0.18522   433   573
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.56240   481   525
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584

File2

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.1234  0.31092   0.356   0.00015   0.93464   0.08232  0.52609  0.5873  0.0034  0.06337   367   985
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.823   956   213
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  1.30621   678   943

Expected output df:

      i       B       C  D  E  F  G       H       I       J      PTA  K  L
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584   File1
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784   File2

This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:

# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
 
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    print(df)
    rows=df[df['PTA'].isna()]
    print(rows)

At the moment, I am missing the part, where to add those rows into the new data frame.

>Solution :

While iterating your files add the new column with filename to your filtered data, append() the new dataframe to a list and pd.concat() all dataframes from list:

...
na_data = []

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    rows=df[df['PTA'].isna()]
    rows['filename'] = filename
    na_data.append(rows)

pd.concat(na_data)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading