Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Dynamically find distance between a text and a nearest div tag using python and BeautifulSoup

I want to parse many html pages and remove a div that contains the text "Message", using beautifulsoup html.parser and python. The div has no name or id, so pointing to it is not possible. I am able to do this for 1 html page. In the code below, you will see 6 .parent . This is because there are 5 tags (p,i,b,span,a) between div tag and the text "Message", and 6th tag is div, in this html page. The code below works fine for 1 html page.

soup = BeautifulSoup(html_page,"html.parser")
scores = soup.find_all(text=re.compile('Message'))
divs = [score.parent.parent.parent.parent.parent.parent for score in scores]
divs.decompose()

The problem is – The number of tags between div and "Message" is not always 6. In some html page its 3, and in some 7. So, is there a way to find the number of tags (n) between the text "Message" and nearest div to the left dynamically, and add n+1 number of .parent to score (in the code above) using python, beautifulsoup?

Thanks in Advance.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

As described in your question, that there is no other <div> between, you could use .find_parent():

soup.find(text=re.compile('Message')).find_parent('div').decompose()

Be aware, that if you use find_all() you have to iterate your ResultSet while unsing .find_parent():

for r in soup.find_all(text=re.compile('Message')):
    r.find_parent('div').decompose()

As in your example divs.decompose() – You also should iterate the list.

Example

from bs4 import BeautifulSoup
import re
html='''
<div>
    <span>
        <i>
            <x>Message</x>
        </i>
    </span>
</div>
'''
soup = BeautifulSoup(html)

soup.find(text=re.compile('Message')).find_parent('div')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading