Scraping content from what appear to be identical HTML elements

Advertisements

Python: Python 3.11.2
Python Editor: PyCharm 2022.3.3 (Community Edition) – Build PC-223.8836.43
OS: Windows 11 Pro, 22H2, 22621.1413
Browser: Chrome 111.0.5563.65 (Official Build) (64-bit)


I’m looking at the following URL — https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963 — from which I’m attempting to scrape data from class elements that seem to have no attributes that distinguish them from each other. Would one have to rely on strings or something like that? Can one even do that?

In the example provided here, I’m trying to scrape both the "case_plaintiff" and "case_defendant" but they have identical attributes.

See:

<tbody>
                    <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Plaintiff:</th>
            <td data-th="Plaintiff" class="has-no-border">Government of Puerto Rico, ELI LILLY EXPORT S. A. and Sanofi-Aventis Puerto Rico Inc.</td>
        </tr>
                <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Defendant:</th>
            <td data-th="Defendant" class="has-no-border">Eli Lilly and Company, Novo Nordisk, Inc., Sanofi-Aventis U. S. LLC, Express Scripts, Inc., CAREMARKPCS HEALTH LLC, Caremark Puerto Rico and OPTUMRX INC.</td>
        </tr>
                    <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Case Number:</th>
            <td data-th="Case Number" class="has-no-border">3:2023cv01127</td>
        </tr>
            <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Filed:</th>
            <td data-th="Filed" class="has-no-border">March 17, 2023</td>
        </tr>
                <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Court:</th>
            <td data-th="Court" class="has-no-border">US District Court for the District of Puerto Rico</td>
        </tr>
            
                <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Nature of Suit:</th>
            <td data-th="Nature of Suit" class="has-no-border">Other Statutory Actions</td>
        </tr>
    
            <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Cause of Action:</th>
            <td data-th="Cause of Action" class="has-no-border">28 U.S.C. § 1441 Notice of Removal</td>
        </tr>
    
            <tr class="zebra -zb table-bordered table-padding-10">
            <th align="left" width="140" class="-hide-tablet width-20">Jury Demanded By:</th>
            <td data-th="Jury Demanded By" class="has-no-border">None</td>
        </tr>
        </tbody>

What does one do in such cases? I don’t really see anything about that issue in any of the books I have.

This is the script that I have that highlights the issue I’m having:

from bs4 import BeautifulSoup
import requests

html_text = requests.get("https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963").text
soup = BeautifulSoup(html_text, "lxml")
cases = soup.find_all("div", class_ = "wrapper jcard has-padding-30 blocks has-no-bottom-padding")

for case in cases:
    case_title = case.find("div", class_ = "title-wrapper").text.replace(" "," ")
    case_plaintiff = case.find("td", class_ = "has-no-border").text.replace(" "," ")
    # this is the line that's causing me a problem
    case_defendant = case.find("td", class_ = "has-no-border").text.replace(" "," ")


    print(f"Case Title: {case_title.strip()}")
    print(f"Plaintiffs: {case_plaintiff.strip()}")
    # and here
    print(f"Defendants: {case_defendant.strip()}")

>Solution :

You can select the elements by the data-th attribute:

case_plaintiff = case.find("td", {"data-th": "Plaintiff"}).text
case_defendant = case.find("td", {"data-th": "Defendant"}).text

Leave a ReplyCancel reply