Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Trying to web scrape text from a table on a website

I am a novice at this, but I’ve been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I’ve tried BeautifulSoup and Scrapy but I can’t get the text out.

Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can’t even get the first wine producer name.

If you inspect the webpage all the details are in tags with no id or class.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

My BeautifulSoup attempt

URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}

page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()

producer = soup2.find_all('td').get_text()

print(producer)

Which is throwing the error:

producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'

My Scrapy attempt

winedf = pd.DataFrame()

class WineSpider(scrapy.Spider):
    name = 'wine_spider'

    def start_requests(self):
        dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
        yield scrapy.Request(url=dwwa_url, callback=self.parse_front)

    def parse_front(self, response):
        table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
        page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
        "dwwa-page-link") @href')
        links_to_follow = page_links.extract()
        for url in links_to_follow:
            yield response.follow(url=url, callback=self.parse_pages)

    def parse_pages(self, response):
        wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
        tr[1]/td[1]/text()').get()
        wine_name_ext = wine_name.extract().strip()
        winedf.append(wine_name_ext)
        medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
        td[4]/text()').get()
        medal_ext = medal.extract().strip()
        winedf.append(medal_ext)

Which produces and empty df.

Any help would be greatly appreciated.

Thank you!

>Solution :

Try:

import pandas as pd

url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)

# print last items in df:
print(df.tail().to_markdown())

Prints:

producer name id competition award score country region subRegion vintage color style priceBandLetter competitionYear competitionType
14853 Telavi Wine Cellar Marani 718257 DWWA 2022 7 86 Georgia Kakheti Kindzmarauli 2021 Red Still – Medium (between 19 and 44 g/L residual sugar) B 2022 DWWA
14854 Štrigova Muškat Žuti 716526 DWWA 2022 7 87 Croatia Continental Zagorje – Međimurje 2021 White Still – Medium (between 19 and 44 g/L residual sugar) C 2022 DWWA
14855 Kopjar Muscat žUti 717754 DWWA 2022 7 86 Croatia Continental Zagorje – Međimurje 2021 White Still – Medium (between 19 and 44 g/L residual sugar) C 2022 DWWA
14856 Cleebronn-Güglingen Blanc De Noir Fein & Fruchtig 719836 DWWA 2022 7 87 Germany Württemberg Not Applicable 2021 White Still – Medium (between 19 and 44 g/L residual sugar) B 2022 DWWA
14857 Winnice Czajkowski Thoma 8 Grand Selection 719891 DWWA 2022 6 90 Poland Not Applicable Not Applicable 2021 White Still – Medium (between 19 and 44 g/L residual sugar) D 2022 DWWA
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading