Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

xml.etree.ElementTree in Python (3.x) is not parsing one particular attribute value

The data file testfile.xml is this:

<?xml version="1.0" encoding="utf-8"?>
<body>
  <body.head>
    <hedline>
      <hl1 style="header">All the things we lost that summer</hl1>
      <hl2 style="standfirst">It was the promise of seals that sold Virginia on this mission.</hl2>
      <hl2 style="dropcap-large"><em class="dropcap">W</em>e are always calling each other names.</hl2>
    </hedline>
  </body.head>
</body>

The script to parse this file is this:

import xml.etree.ElementTree as ET
tree = ET.parse('testfile.xml')
root = tree.getroot()
if root.find('body.head') is not None:
    if root.find('body.head').find('hedline') is not None:
        for child1 in root.find('body.head').find('hedline'):
            print("Tag    level 1:" + child1.tag)
            print("Attrib level 1:" + str(child1.attrib))
            print("Text   level 1:" + str(child1.text) + "\n")
            for child2 in child1:
                print("Tag    level 2:" + child2.tag)
                print("Attrib level 2:" + str(child2.attrib))
                print("Text   level 2:" + str(child2.text))

And this is the result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Tag    level 1:hl1
Attrib level 1:{'style': 'header'}
Text   level 1:All the things we lost that summer

Tag    level 1:hl2
Attrib level 1:{'style': 'standfirst'}
Text   level 1:It was the promise of seals that sold Virginia on this mission.

Tag    level 1:hl2
Attrib level 1:{'style': 'dropcap-large'}
Text   level 1:None  <-- THIS IS THE PROBLEM

Tag    level 2:em
Attrib level 2:{'class': 'dropcap'}
Text   level 2:W

I would expect the report line "Text level 1:" to report the value "e are always calling each other names." from the data file, but instead it cannot parse it so it ends up being None.
Can you perhaps parse it correctly?
This is Python 3.12 on Windows.

Thanks, Martijn

>Solution :

That’s because in ElementTree (and lxml), "e are always calling each other names." is the .tail of the em element.

The .text property only includes the first text node that is before any other child elements. In this case, there are none.

See tail here for more info.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading