Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting specific tag from XML in python using BeautifulSoup

I have a metadata file that looks like this:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id" version="2.0">
    <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
        <dc:title>Princeton Review Digital SAT Premium Prep, 2024: 4 Practice Tests + Online Flashcards + Review &amp; Tools</dc:title>
        <dc:creator opf:file-as="Princeton Review, The" opf:role="aut">The Princeton Review</dc:creator>
        <dc:identifier opf:scheme="ISBN">9780593516874</dc:identifier>
        <dc:identifier opf:scheme="AMAZON">0593516877</dc:identifier>
        <dc:identifier opf:scheme="GOODREADS">63139948</dc:identifier>
        <dc:identifier opf:scheme="GOOGLE">o6i4EAAAQBAJ</dc:identifier>
    </metadata>
</package>

I know how to use BeautifulSoup to extract fields like <dc.title>. I’m struggling how to extract only the ISBN field (<dc:identifier opf:scheme="ISBN">).

from bs4 import BeautifulSoup

with open ('metadata.opf', 'r') as f:
    file = f.read()

metadata = BeautifulSoup(file, 'xml')
title = metadata.find('dc:title')
print(title.text)

author = metadata.find('dc:creator')
print(author.text)

# isbn = metadata.find_all('dc:identifier'). # This finds 4 fields, as expected.  

How do I limit it? I can’t depend on the order of the fields, and the ISBN length can vary.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

According to the documentation, the find method has an argument attribute using it you should be able to select ISBN

isbn = metadata.find('dc:identifier', attrs={"opf:scheme": "ISBN"})

So the code could be written like

from bs4 import BeautifulSoup

with open ('metadata.opf', 'r') as f:
    file = f.read()

metadata = BeautifulSoup(file, 'xml')
title = metadata.find('dc:title')
print(title.text)

author = metadata.find('dc:creator')
print(author.text)

isbn = metadata.find('dc:identifier', attrs={"opf:scheme": "ISBN"}) # This finds 4 fields, as expected. 
print(isbn.text) 

and should result in

Princeton Review Digital SAT Premium Prep, 2024: 4 Practice Tests + Online Flashcards + Review & Tools
The Princeton Review
9780593516874

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading