The data file testfile.xml is this:
<?xml version="1.0" encoding="utf-8"?>
<body>
<body.head>
<hedline>
<hl1 style="header">All the things we lost that summer</hl1>
<hl2 style="standfirst">It was the promise of seals that sold Virginia on this mission.</hl2>
<hl2 style="dropcap-large"><em class="dropcap">W</em>e are always calling each other names.</hl2>
</hedline>
</body.head>
</body>
The script to parse this file is this:
import xml.etree.ElementTree as ET
tree = ET.parse('testfile.xml')
root = tree.getroot()
if root.find('body.head') is not None:
if root.find('body.head').find('hedline') is not None:
for child1 in root.find('body.head').find('hedline'):
print("Tag level 1:" + child1.tag)
print("Attrib level 1:" + str(child1.attrib))
print("Text level 1:" + str(child1.text) + "\n")
for child2 in child1:
print("Tag level 2:" + child2.tag)
print("Attrib level 2:" + str(child2.attrib))
print("Text level 2:" + str(child2.text))
And this is the result:
Tag level 1:hl1
Attrib level 1:{'style': 'header'}
Text level 1:All the things we lost that summer
Tag level 1:hl2
Attrib level 1:{'style': 'standfirst'}
Text level 1:It was the promise of seals that sold Virginia on this mission.
Tag level 1:hl2
Attrib level 1:{'style': 'dropcap-large'}
Text level 1:None <-- THIS IS THE PROBLEM
Tag level 2:em
Attrib level 2:{'class': 'dropcap'}
Text level 2:W
I would expect the report line "Text level 1:" to report the value "e are always calling each other names." from the data file, but instead it cannot parse it so it ends up being None.
Can you perhaps parse it correctly?
This is Python 3.12 on Windows.
Thanks, Martijn
>Solution :
That’s because in ElementTree (and lxml), "e are always calling each other names." is the .tail of the em element.
The .text property only includes the first text node that is before any other child elements. In this case, there are none.
See tail here for more info.