How to parse and stack XML nodes and children correctly?

January 26, 2022

I am currently trying to analyze some voting behavior in the European Parliament, using the parliaments XML interface. However, even though I am able to import the information and manipulate them somehow, I am not able to a meaningful pandas DataFrame.

E.g. I try to set up two data frame with "for" and "against" votes. However, both data frame yield the same size and the same order…Can someone please help?

Thanks!

import lxml
import xml.etree.ElementTree as ET
from itertools import product, chain
from urllib.request import urlopen

import io

var_url = urlopen('https://www.europarl.europa.eu/doceo/document/PV-9-2020-12-18-RCV_FR.xml')
xmldoc = ET.parse(var_url)
xmlroot = xmldoc.getroot()

vote_items = []
all_vote_items = []
for avote in xmlroot.iter('RollCallVote.Result'):
    vote_Nr = avote.attrib.get('Identifier')
    for anitem in avote.iter('Result.For'):
            for amep in avote.iter('PoliticalGroup.Member.Name'):
                mep_id = amep.get('MepId')
                vote_items = [vote_Nr, mep_id]
                all_vote_items.append(vote_items)
for_meps = pd.DataFrame(all_vote_items,columns=['VOTE_NUMBER','vmep_id'])      


vote_items = []
all_vote_items = []
for avote in xmlroot.iter('RollCallVote.Result'):
    vote_Nr = avote.attrib.get('Identifier')
    for anitem in avote.iter('Result.Against'):
            for amep in avote.iter('PoliticalGroup.Member.Name'):
                mep_id = amep.get('MepId')
                vote_items = [vote_Nr, mep_id]
                all_vote_items.append(vote_items)
against_meps = pd.DataFrame(all_vote_items,columns=['VOTE_NUMBER','vmep_id'])

UPDATE:

I tried now to combine all three, however fall back to a (39,4) data frame. How can I stack correctly?

vote_items = []
all_vote_items = []

for avote in xmlroot.iter('RollCallVote.Result'):
    vote_Nr = avote.attrib.get('Identifier')
    
    for anitem in avote.iter('Result.For'):
        for agroup in anitem.iter('Result.PoliticalGroup.List'):
            for amep in agroup.iter('PoliticalGroup.Member.Name'):
                mep_id_for = amep.get('MepId')
    
    for anitem in avote.iter('Result.Against'):
        for agroup in anitem.iter('Result.PoliticalGroup.List'):
            for amep in agroup.iter('PoliticalGroup.Member.Name'):
                mep_id_against = amep.get('MepId')
                
    for anitem in avote.iter('Result.Abstention'):
        for agroup in anitem.iter('Result.PoliticalGroup.List'):
            for amep in agroup.iter('PoliticalGroup.Member.Name'):
                mep_id_abstention = amep.get('MepId')
    
    
    vote_items = [vote_Nr, mep_id_for, mep_id_against, mep_id_abstention]
    all_vote_items.append(vote_items)
            
all_meps = pd.DataFrame(all_vote_items,columns=['vote_nr','vote_for','vote_against','vote_abstention'])```

>Solution :

I believe there’s a bug in your code on this line:

for amep in avote.iter('PoliticalGroup.Member.Name'):

You should probably iterate over anitem objects instead of avote. There are two places where you need to fix it. I’ve just checked and this results in different all_vote_items lists.