Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting pmid and publication type from PubMed xml in Python

I am very new to this and trying to do a basic task.
From xml file like this I try to extract pmid and publication type. Three sample files are here

<PMID Version="1">144418</PMID>
<PublicationType UI="D016428">Journal Article</PublicationType>

Ideally I want to have a pd dataframe:

Expected output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

PMID    Publication_type
1       Journal article
2       Journal article
3       Journal article

But if somebody can tell at least how to extract for 1 file, I would greatly appreciate it too! I will figure it out how to put it into a dataframe.

>Solution :

  • Use glob to iterate through all XML files

  • Use BeautifulSoup to parse XML content

  • Use soup.find() and soup.find_all() to find elements in the XML

  • Use .text() to get the string from text node under the element

  • Store content as a dict and append to a list

  • Use pd.DataFrame(<list>) to create dataframe from given list

  • Note that each PMID might contain multiple Publication_type, so, use explode() to split the list of Publication_type into multiple rows referred to the PMID

Code:

import pandas as pd
from glob import glob
from bs4 import BeautifulSoup

l = list()

for f in glob('*.xml'):
    pub = dict()

    with open(f, 'r') as xml_file:
        xml = xml_file.read()

    soup = BeautifulSoup(xml, "lxml")
    pub['PMID'] = soup.find('pmid').text
    pub_list = soup.find('publicationtypelist')
    pub['Publication_type'] = list()
    for pub_type in pub_list.find_all('publicationtype'):
        pub['Publication_type'].append(pub_type.text)
    l.append(pub)

df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)

Output:

>         PMID    Publication_type
>     0   144418  Journal Article
>     1   272056  English Abstract
>     2   272056  Journal Article
>     3   349115  Editorial
>     4   349115  Historical Article
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading