Extracting pmid and publication type from PubMed xml in Python

December 21, 2021

I am very new to this and trying to do a basic task.
From xml file like this I try to extract pmid and publication type. Three sample files are here

<PMID Version="1">144418</PMID>
<PublicationType UI="D016428">Journal Article</PublicationType>

Ideally I want to have a pd dataframe:

Expected output:

PMID    Publication_type
1       Journal article
2       Journal article
3       Journal article

But if somebody can tell at least how to extract for 1 file, I would greatly appreciate it too! I will figure it out how to put it into a dataframe.

>Solution :

Use glob to iterate through all XML files
Use BeautifulSoup to parse XML content
Use soup.find() and soup.find_all() to find elements in the XML
Use .text() to get the string from text node under the element
Store content as a dict and append to a list
Use pd.DataFrame(<list>) to create dataframe from given list
Note that each PMID might contain multiple Publication_type, so, use explode() to split the list of Publication_type into multiple rows referred to the PMID

Code:

import pandas as pd
from glob import glob
from bs4 import BeautifulSoup

l = list()

for f in glob('*.xml'):
    pub = dict()

    with open(f, 'r') as xml_file:
        xml = xml_file.read()

    soup = BeautifulSoup(xml, "lxml")
    pub['PMID'] = soup.find('pmid').text
    pub_list = soup.find('publicationtypelist')
    pub['Publication_type'] = list()
    for pub_type in pub_list.find_all('publicationtype'):
        pub['Publication_type'].append(pub_type.text)
    l.append(pub)

df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)

Output:

>         PMID    Publication_type
>     0   144418  Journal Article
>     1   272056  English Abstract
>     2   272056  Journal Article
>     3   349115  Editorial
>     4   349115  Historical Article