Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to scrape zip files into a single dataframe in python

I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.

I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code

import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")

So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Thank you

>Solution :

To open a zipfile and read the files there to a dataframe you can use next example:

import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile

zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"

dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
    for file in zf.namelist():
        df = pd.read_csv(
            zf.open(file),
            sep=";",
            skiprows=1,
            skipfooter=1,
            engine="python",
            header=None,
        )
        dfs.append(df)

final_df = pd.concat(dfs)

# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))

Prints:

0 1 2 3 4 5 6
2017 1 1 1 58.82 58.82 nan
2017 1 1 2 58.23 58.23 nan
2017 1 1 3 51.95 51.95 nan
2017 1 1 4 47.27 47.27 nan
2017 1 1 5 46.9 45.49 nan
2017 1 1 6 46.6 44.5 nan
2017 1 1 7 46.25 44.5 nan
2017 1 1 8 46.1 44.72 nan
2017 1 1 9 46.1 44.22 nan
2017 1 1 10 45.13 45.13 nan
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading