Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

BeautifulSoup extract base text

I have a div that looks somewhat like

<div>
    " Base Text "
    <span> 
        " Inner Text "
    </span>
    " Outer Base Text "
</div>

And I want to extract only the text not in the div’s children (the immediate text), in this example, the immediate text is " Base Text " and " Outer Base Text ".

Is there any direct way (like a beautifulsoup function) to get the outer text in the div only, and ignore its inner contents?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

No direct way. Best you can do is get the whole tag, then list comprehension to keep only the main/parent tag/node:

html_content = '''
<div>
    Base Text
    <span> 
        Inner Text
    </span>
    Outer Base Text
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')

div = soup.find('div')

# Extract the text directly within the div, excluding children
text = ''.join([str(text) for text in div.strings if text.parent == div])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading