Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Parse HTML table for specific content in one column and print resulting table to file with python

I have a file test_input.htm with a table:

    <table>
          <thead>
               <tr>
                    <th>Acronym</th>
                    <th>Full Term</th>
                    <th>Definition</th>
                    <th>Product </th>
                </tr>
         </thead>
         <tbody>
                <tr>
                    <td>a1</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>PRISMA</p>
                        <p>SDDS-NG</p>
                    </td>
                </tr>
                <tr>
                    <td>a2</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>PRISMA</p>
                    </td>
                </tr>
                <tr>
                    <td>a3</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>SDDS-NG</p>
                    </td>
                </tr> 
                <tr>
                    <td>a4</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: SD-GLO</p>
                    </td>
                    <td>
                        <p>SDDS-NG</p>
                    </td>
                </tr>         
           </tbody>
    </table>

I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).

The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from bs4 import BeautifulSoup

file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')


with open('test_output.htm', 'a') as f:
    data = [[td.findChildren(text=True) for td in inhalte]]
    for line in inhalte: #if you see a line in the table
        if line.get_text().find('PRISMA') > -1 : #and you find the specific string
                f.write("%s\n" % str(line)) 

I really tried hard but could not figure out how to restict the search to column 4.
The following did not work:

data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]  

I would really appreciate if someone could help me find the solution.

>Solution :

Select more specific to get the elements you expect – For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:

soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')

Example

from bs4 import BeautifulSoup

file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')

with open('test_output.htm', 'a') as f:
    for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
        f.write("%s\n" % str(line)) 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading