Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Read and tabulate table-like data from website

I want to tabulate and store into Pandas this linked data from the U.S. Weather Service.

Here are the first four lines of the webpage.

Historic Crests

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

(1) 43.28 ft on 04/17/1979

(2) 39.58 ft on 05/25/1983

(3) 36.67 ft on 02/17/2020

You can access the data from an IDE or notebook using the following code.

import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"

webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)

print(soup.get_text())

The data is already in a table-like structure, and I think you could tabulate it by parsing it into a dictionary of lists and then uploading it into a Pandas Dataframe. However, I imagine that there is a more simple pythonic approach.

Here’s a snippet of the desired table structure.

No. Crest Date
1 43.28 04/17/1979
2 39.58 05/25/1983
3 36.67 02/17/2020

>Solution :

You could create a regex pattern and feed it to Series.str.extractall. E.g.:

tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'\n')

df = pd.Series(vals).str.extractall(r'\((?P<No>\d+)\)\s(?P<Crest>\d+.\d+)\sft\son\s(?P<Date>\d{2}\/\d{2}\/\d{4})')\
    .reset_index(drop=True)

print(df)

    No  Crest        Date
0    1  43.28  04/17/1979
1    2  39.58  05/25/1983
2    3  36.67  02/17/2020
3    4  36.30  03/31/1902
4    5  36.30  12/05/1880
..  ..    ...         ...
86  87  30.30  04/20/1955
87  88  30.26  01/25/1983
88  89  30.26  04/07/1981
89  90  30.25  01/24/2020
90  91  30.23  12/23/1967

And maybe change the dtypes at this point:

df['No'] = df.No.astype(int)
df['Date'] = pd.to_datetime(df.Date)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading