I’m using Python 3 and the BeautifulSoup module, version 4.9.3. I’m trying to use this package to practise parsing some simple HTML.
The string I have is the following:
text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''
I use BeautifulSoup as follows:
x = BeautifulSoup(text, "html.parser")
I then experiment with Beautiful Soup’s functionality with the following script:
for li in x.find_all('li'):
print(li)
print(li.string)
print(li.next_element)
print(li.next_element)
print(li.next_element.string)
print("\n")
The results (at least for the first iteration) are unexpected:
<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text
<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here
Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?
Similarly, if I do:
x.find_all('li', string=re.compile('text'))
I only get one result (the 2nd tag).
But if I do:
for li in x.find_all('li'):
print(li.find_all(string=re.compile('text')))
I get 2 results (both tags).
>Solution :
Paraphrasing the doc:
- If a tag has only one child, and that child is a
NavigableString, the child is made available as.string.- If a tag’s only child is another tag, and that tag has a
.string, then the parent tag is considered to have the same.stringas its child.- If a tag contains more than one thing, then it’s not clear what
.stringshould refer to, so.stringis defined to be None.
Let’s apply these rules to your question:
Why is the string attribute of the first
litagNone, whereas the string attribute of the innerptag is notNone?
The inner p tag satisfies rule #1; it has exactly one child, and that child is a NavigableString, so .string returns that child.
The first li satisfies rule #3; it has more than one child, so .string would be ambiguous.