BeautifulSoup extract base text

May 23, 2024

I have a div that looks somewhat like

<div>
    " Base Text "
    <span> 
        " Inner Text "
    </span>
    " Outer Base Text "
</div>

And I want to extract only the text not in the div’s children (the immediate text), in this example, the immediate text is " Base Text " and " Outer Base Text ".

Is there any direct way (like a beautifulsoup function) to get the outer text in the div only, and ignore its inner contents?

>Solution :

No direct way. Best you can do is get the whole tag, then list comprehension to keep only the main/parent tag/node:

html_content = '''
<div>
    Base Text
    <span> 
        Inner Text
    </span>
    Outer Base Text
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')

div = soup.find('div')

# Extract the text directly within the div, excluding children
text = ''.join([str(text) for text in div.strings if text.parent == div])