I have an issue with BeautifukSoup. Whenever I parse an HTML input, it closes HTML tags that weren’t closed (e.g. <input> , or tags that weren’t closed by mistake).
For example:
from bs4 import BeautifulSoup
tags = BeautifulSoup('<span id="100" class="test">', "html.parser")
print(str(tags))
Prints:
<span id="100" class="test"></span>
My main goal here is to preserve the original shape of the HTML input after parsing it.
I found that it’s possible by using "XML" parser instead of "html.parser", but I am looking to solve this for "html.parser".
>Solution :
You can poke through bs4 internals and modify how the html.parser treats HTML (this works for my version bs4==4.12.2):
from bs4 import BeautifulSoup
from bs4.builder import builder_registry
from bs4.formatter import HTMLFormatter
class UnsortedAttributes(HTMLFormatter):
def __init__(self):
super().__init__(
void_element_close_prefix=""
) # <-- use void_element_close_prefix="" here
def attributes(self, tag):
yield from tag.attrs.items()
html_text = """\
<closed_tag>
<my_tag id="xxx">
<my_other_tag id="zzz">
</closed_tag>"""
builder_registry.lookup("html.parser").empty_element_tags = {"my_tag", "my_other_tag"}
soup = BeautifulSoup(html_text, "html.parser")
print(soup.encode(formatter=UnsortedAttributes()).decode())
Prints:
<closed_tag>
<my_tag id="xxx">
<my_other_tag id="zzz">
</closed_tag>