Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

XML parser in BeautifulSoup only scrapes the first symbol out of two

I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.

<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
    <key>
        <symbol>WDS</symbol>
        <exchange>NYE</exchange>
        <openfigi>BBG001S5WCY6</openfigi>
        <qmidentifier>USI79Z473117AAG</qmidentifier>
    </key>
    <equityinfo>
        <longname>
        Woodside Energy Group Limited American Depositary Shares each representing one
        </longname>
        <shortname>Woodside Energy </shortname>
        2
        <instrumenttype>equity</instrumenttype>
        <sectype>DR</sectype>
        <isocfi>EDSXFR</isocfi>
        <issuetype>AD</issuetype>
        <proprietaryquoteeligible>false</proprietaryquoteeligible>
    </equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
    <key>
        <symbol>PAM</symbol>
        <exchange>NYE</exchange>
        <openfigi>BBG001T5K0S1</openfigi>
        <qmidentifier>USI68Z3Z75887AS</qmidentifier>
    </key>
    <equityinfo>
        <longname>Pampa Energia S.A.</longname>
        <shortname>PAM</shortname>
        <instrumenttype>equity</instrumenttype>
        <sectype>DR</sectype>
        <isocfi>EDSXFR</isocfi>
        <issuetype>AD</issuetype>
    </equityinfo>
</lookupdata>

When I read the xml content from a text file and parse the symbols, I get only the first symbol.

from bs4 import BeautifulSoup

with open("input_xml.txt") as infile:
    item = infile.read()

soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
    print(item.text)

current output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

WDS

If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.

As the content is xml, I would like to stick to xml parser instead of lxml.

Expected output:

WDS
PAM

>Solution :

The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I’m assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.

You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.

Or, to keep using the xml library, surround it with some top-level dummy element, like:

from bs4 import BeautifulSoup

with open("input_xml.txt") as infile:
    item = infile.read()

patched = f"<root>{item}</root>"

soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
    print(found.text)

Output:

WDS
PAM
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading