I am very new to this and trying to do a basic task.
From xml file like this I try to extract pmid and publication type. Three sample files are here
<PMID Version="1">144418</PMID>
<PublicationType UI="D016428">Journal Article</PublicationType>
Ideally I want to have a pd dataframe:
Expected output:
PMID Publication_type
1 Journal article
2 Journal article
3 Journal article
But if somebody can tell at least how to extract for 1 file, I would greatly appreciate it too! I will figure it out how to put it into a dataframe.
>Solution :
-
Use
globto iterate through all XML files -
Use
BeautifulSoupto parse XML content -
Use
soup.find()andsoup.find_all()to find elements in the XML -
Use
.text()to get the string from text node under the element -
Store content as a
dictand append to alist -
Use
pd.DataFrame(<list>)to createdataframefrom givenlist -
Note that each
PMIDmight contain multiplePublication_type, so, useexplode()to split the list ofPublication_typeinto multiple rows referred to thePMID
Code:
import pandas as pd
from glob import glob
from bs4 import BeautifulSoup
l = list()
for f in glob('*.xml'):
pub = dict()
with open(f, 'r') as xml_file:
xml = xml_file.read()
soup = BeautifulSoup(xml, "lxml")
pub['PMID'] = soup.find('pmid').text
pub_list = soup.find('publicationtypelist')
pub['Publication_type'] = list()
for pub_type in pub_list.find_all('publicationtype'):
pub['Publication_type'].append(pub_type.text)
l.append(pub)
df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)
Output:
> PMID Publication_type > 0 144418 Journal Article > 1 272056 English Abstract > 2 272056 Journal Article > 3 349115 Editorial > 4 349115 Historical Article