Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Beautiful Soup extract string before tag

I have an xml file that has ref tags nested inside para tags:

<para>here be text<ref> REF 1 </ref>and here be some more text</para>

Is there a way using Beautiful Soup to extract the string between the opening para tag and the opening ref tag, ie:

here be text

I’ve tried various things to no avail, including find_previous:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

soup = BeautifulSoup(file, 'xml')

ref = soup.find('ref')
ref_before = ref.find_previous('para')

But (obviously) ref_before returns the entire contents of the para tag, ie:

here be text REF 1 and here be some more text

I think this ought to be really simple but I don’t have much experience and just can’t crack it. Any help much appreciated.

>Solution :

You can use contents and select the first element:

soup.find('para').contents[0]

Output:

'here be text'
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading