BeautifulSoup – HTML parsing not working as expected

February 24, 2022

I’m using Python 3 and the BeautifulSoup module, version 4.9.3. I’m trying to use this package to practise parsing some simple HTML.

The string I have is the following:

text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''

I use BeautifulSoup as follows:

x = BeautifulSoup(text, "html.parser")

I then experiment with Beautiful Soup’s functionality with the following script:

for li in x.find_all('li'):
    print(li)
    print(li.string)
    print(li.next_element)
    print(li.next_element)
    print(li.next_element.string)
    print("\n")

The results (at least for the first iteration) are unexpected:

<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text


<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

Similarly, if I do:

x.find_all('li', string=re.compile('text'))

I only get one result (the 2nd tag).

But if I do:

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))

I get 2 results (both tags).

>Solution :

Paraphrasing the doc:

If a tag has only one child, and that child is a NavigableString, the child is made available as .string.

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

Let’s apply these rules to your question: