Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python lxml not read XML properlly

I am using Python 2.7 (I can not upgrade to any new version sadly) and I am trying to parse 2 XML files, using lxml but something is not right and I am not sure what I am doing wrong:

CODE:

from lxml import etree as ET

def string_to_lxml(string):
    xml_file = bytes(bytearray(string, encoding='utf-8'))
    return ET.XML(xml_file)


def find_all(tag, atr):
    return tag.xpath("//%s" % atr)

xml_str_1 = """<?xml version="1.0" encoding="UTF-8"?>
<A xmlns="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0">
    <B name="SOME_NAME_0">
        <C/>
        <D>SOME NAME</D>
        <AA>
            <dir name="include" filters="*.h *.hpp *.tpp *.i"/>
        </AA>
        <H>
            <TAG_1 name="main" default="true"/>
        </H>
    </B>
    <TT>
        <GG>
            <FF configs="main">
                <TAG_2 name="NAME_1"/>
                <TAG_2 name="NAME_2"/>
                <TAG_3 name="NAME_3"/>
                <TAG_3 name="NAME_4"/>
                <TAG_3 name="NAME_5"/>
            </FF>
        </GG>
    </TT>
</A>"""

xml_str_2 = """<?xml version='1.0' encoding='UTF-8'?>
<A xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://obe.nce.amadeus.net/bms/metadata/1-0/">
    <B name="NAME" version="VERSION">
        <AA>SOME NAME</AA>
        <CC>SOME OTHER NAME</CC>
    </B>
    <C>
        <TAG_3 name="NAME_1" path="path_1"/>
        <TAG_3 name="NAME_2" path="path_2"/>
        <TAG_3 name="NAME_3" path="path_3"/>
    </C>
    <D>
        <TAG_3 type="type" name="NAME_1" version="version_1"/>
        <TAG_3 type="type" name="NAME_2" version="version_2"/>
        <TAG_3 type="type" name="NAME_3" version="version_3"/>
    </D>
</A>
"""
root = string_to_lxml(xml_str_1)
print(find_all(root, "TAG_3"))

root = string_to_lxml(xml_str_2)
print(find_all(root, "TAG_3"))

Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

[]
[<Element TAG_3 at 0x7f257c126640>, <Element TAG_3 at 0x7f257c126be0>, <Element TAG_3 at 0x7f257c126b90>, <Element TAG_3 at 0x7f257c126e10>, <Element TAG_3 at 0x7f257c128730>, <Element TAG_3 at 0x7f257c128640>]

Did I parse the XML in a wrong way?

>Solution :

First XML defines an anonymous namespace that must be taken into account
xmlns="http://www.w3.org/2001/XMLSchema-instance"
For that, the xpath expression can be expressed as follows

def find_all(tag, atr):
    return tag.xpath("//*[local-name()= '%s']" % atr)

Result:

[<Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73de88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73df88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73dfc8>]
[<Element TAG_3 at 0x7f39cf73df88>, <Element TAG_3 at 0x7f39cf73dfc8>, <Element TAG_3 at 0x7f39cf73dec8>, <Element TAG_3 at 0x7f39cf762048>, <Element TAG_3 at 0x7f39cf762088>, <Element TAG_3 at 0x7f39cf762108>]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading