Looping over text files and get rows with NaN values in column in python

August 16, 2022

I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA and add the filename as additional column to those extracted rows.

So for example:

File1

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.24055  0.31092   0.03447   0.00015   0.93464   0.08232  0.52609  0.00560  0.44018  0.06337   236   770
1  0.43976  0.45359   0.01220   0.93317   0.05711   0.06316  0.49310  0.05882  0.51825  0.18522   433   573
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.56240   481   525
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584

File2

             i       B       C  D  E  F  G       H       I       J      PTA  K  L
0  0.1234  0.31092   0.356   0.00015   0.93464   0.08232  0.52609  0.5873  0.0034  0.06337   367   985
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784
2  0.48067  0.17356   0.96903   0.02968   0.08864   0.05567  0.30423  0.02337  0.01981  0.823   956   213
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  1.30621   678   943

Expected output df:

      i       B       C  D  E  F  G       H       I       J      PTA  K  L
3  0.41872  0.18580   0.00191   0.08048   0.90871   0.02035  0.23598  0.01610  0.19815  NaN   422   584   File1
1  0.975  0.367   0.01220   0.875   0.05711   0.0365  0.49310  0.05882  0.51825  NaN   635   784   File2

This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:

# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
 
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    print(df)
    rows=df[df['PTA'].isna()]
    print(rows)

At the moment, I am missing the part, where to add those rows into the new data frame.

>Solution :

While iterating your files add the new column with filename to your filtered data, append() the new dataframe to a list and pd.concat() all dataframes from list:

...
na_data = []

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    df=pd.read_csv(f, sep='\t',comment='#')
    rows=df[df['PTA'].isna()]
    rows['filename'] = filename
    na_data.append(rows)

pd.concat(na_data)