Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Refactoring pandas using an iterator via chunksize

I am looking for advice on using a pandas iterator.

I performed a parsing operation using Python pandas, the size of the input files (a bioinformatics program called eggNOG) is resulting in ‘RAM bottleneck’ phenomenon. It’s just not processing the file.

The obvious solution is to shift to an iterator, which for pandas is the chunksize option

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd
import numpy as np

df = pd.read_csv('myinfile.csv', sep="\t", chunksize=100)

Whats changed with the original code is the chunksize=100 bit, forcing an iterator.

The next step is just to perform a simple operation, dropping a few columns and moving all ‘-‘ characters to np.nan then writing the whole file.

df.drop(['score', 'evalue', 'Description', 'EC', 'PFAMs'],axis=1).replace('-', np.nan)
df.to_csv('my.csv',sep='\t',index=False)

How is this done under a pandas iterator?

>Solution :

IIUC, you can do:

cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
data = []
for chunk in pd.read_csv('myinfile.csv', sep='\t', na_values='-', chunksize=100):
    chunk = chunk.drop(columns=cols_to_drop)
    data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='\t', index=False)

If you know the columns you want to keep instead of which ones you want to drop, use:

cols_to_keep = ['col1', 'col2', 'col3']
data = []
for chunk in pd.read_csv('myinfile.csv', usecols=cols_to_keep, usesep='\t', na_values='-', chunksize=100):
    data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='\t', index=False)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading