Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

BeautifulSoup – HTML parsing not working as expected

I’m using Python 3 and the BeautifulSoup module, version 4.9.3. I’m trying to use this package to practise parsing some simple HTML.

The string I have is the following:

text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I use BeautifulSoup as follows:

x = BeautifulSoup(text, "html.parser")

I then experiment with Beautiful Soup’s functionality with the following script:

for li in x.find_all('li'):
    print(li)
    print(li.string)
    print(li.next_element)
    print(li.next_element)
    print(li.next_element.string)
    print("\n")

The results (at least for the first iteration) are unexpected:

<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text


<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

Similarly, if I do:

x.find_all('li', string=re.compile('text'))

I only get one result (the 2nd tag).

But if I do:

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))

I get 2 results (both tags).

>Solution :

Paraphrasing the doc:

  1. If a tag has only one child, and that child is a NavigableString, the child is made available as .string.
  2. If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.
  3. If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

Let’s apply these rules to your question:

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

The inner p tag satisfies rule #1; it has exactly one child, and that child is a NavigableString, so .string returns that child.

The first li satisfies rule #3; it has more than one child, so .string would be ambiguous.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading