Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python BeautifulSoup find_all() method return unnecessary element

I’m having trouble with scrapping elements with the find_all() method.

I am looking for the <li class='list-row'>.....</li> tag but
after scrapping it returns <li class='list-row reach-list'> tags with different classes too.

I tried with the select() method too.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here’s the python code:

with open('index.html', 'r') as f:
     contents = f.read()
    soup = BeautifulSoup(html,"html.parser")
    main_block = conn(limit_txt,limit).find('ul', class_='list')
    for li in main_block.find_all('li',class_='list-row'):
        print(li.prettify())

Here’s the html file:
index.html

<ul class="list">
 <li class="list-row">
  <h2>
   <a href="/praca/emis/O4533184" id="offer4533184">
    <span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
   </a>
  </h2>
 </li>
 <li class="list-row reach-list">
  <ul class="list">
    <span class="employer">
     IT lions consulting a.s.
    </span>
   </li>
  </ul>
 </li>
</ul>

>Solution :

You can specify that you only want <li> tags which contains <h2> element (for example):

from bs4 import BeautifulSoup

html_doc = '''\
<ul class="list">
 <li class="list-row">
  <h2>
   <a href="/praca/emis/O4533184" id="offer4533184">
    <span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
   </a>
  </h2>
 </li>
 <li class="list-row reach-list">
  <ul class="list">
    <span class="employer">
     IT lions consulting a.s.
    </span>
   </li>
  </ul>
 </li>
</ul>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for li in soup.select('.list-row:has(h2)'):
    print(li)

Prints:

<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
</a>
</h2>
</li>

Or: To select only <li> with titles: '.list-row:has(.title)'

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading