Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas.read_excel() not speed up via "nrows=10"

With CSV files I sometimes use nrows= parameter for debugging purpose and to "speed up" the reading of the file which is an XLSX file.

I tested the same parameter with pandas.read_excel() reading an over 400k lines excel file. But reading that file take round about 3 minutes and 20 seconds no matter if I do nrows=10 or if I don’t use nrows.

The result is of course only 10 rows.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I assume this is because of fhe Excel-File format where it is not possible to physically skip/ignore lines while reading?

>Solution :

Parsing an XLSX file involves opening a ZIP (OOXML documents are zips of XML files), parsing some XML to find out what sheets there are, then parsing the particular sheet’s XML and interpreting the contents to figure out the contents of each cell, etc.

That’s not quite as straightforward as opening a text file and only reading ten lines.

I might recommend reading the XLS(X) file once into a dataframe, and then e.g. pickling that dataframe for subsequent use. If you’re feeling fancy, you could write a function that invisibly does that for you (tries to look for a "cached" pickled version of your document(s)).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading