Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

zipfile.BadZipFile: File is not a zip file when using "openpyxl" engine

I have created a script which dumps the excel sheets stored in S3 into my local postgres database. I’ve used pandas read_excel and ExcelFile method to read the excel sheets.
Code for the same can be found here.

import boto3
import pandas as pd
import io
import os
from sqlalchemy import create_engine
import xlrd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxxxxxxxx"
s3 = boto3.client('s3')

obj = s3.get_object(Bucket='bucket-name', Key='file.xlsx')
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
print(data.sheet_names)
a = len(data.sheet_names)

engine1 = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
for i in range(a):
    df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i], engine='openpyxl')
    df.to_sql("test"+str(i), engine1, index=False)

Basically, code parses the S3 bucket and runs in a loop. For each sheet, it creates a table
and dumps the data from sheet in that table.

Where I’m having trouble is, when I run this code, I get this error.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i-1], engine='openpyxl')
zipfile.BadZipFile: File is not a zip file

This is coming after I added ‘openpyxl’ engine in read_excel method. When I remove the engine, I get this error.

raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.

Please note that I can print the connection to database, so there is no problem in connectivity, and I’m using latest version of python and pandas. Also, I can get all the sheet_names in the excel file so I’m able to reach to that file as well.

Many Thanks!

>Solution :

You are reading the obj twice, fully:

  1. data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
  2. pd.read_excel(io.BytesIO(obj['Body'].read()), ...)

Your object can only be .read() once, second read produce nothing, an empty b"".

In order to avoid re-reading the S3 stream many times, you could store it once in a BytesIO, and rewind that BytesIO with seek.

buf = io.BytesIO(obj["Body"].read())

pd.ExcelFile(buf)

buf.seek(0)

pd.read_excel(buf, ...)

# repeat
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading