Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Scrape Text and save File with Bold Text Intact?

I am very new to Python and webscraping. I have tried to search for an answer, but cannot find it. It might be because I don’t know the terminology to ask the right question.

I am trying to web scrape using python – beautiful soup in order to extract the English transliterations of verb tables from a website (https://www.pealim.com/dict/28-lavo/) that conjugates modern Hebrew verbs. I am then trying to save the text to a txt file. The sticking point is I am trying to get the bold formatting tag to remain intact during the scraping/saving to file, because they are important to know where the stress falls in the word.

Here is an example of what I am getting:
ba’im

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

And here is what I would like:
ba’im

I’m including an image because when I post the HTML code, it’s automatically rendering it:

What I’m looking to do

By looking around the forums, I have come up with code gets me close to what I need, but I cannot figure out how to get the bold tags in there as well.

import requests
from bs4 import BeautifulSoup as bs

#load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")

#Convert to a soup object
soup = bs(r.content)

#Find the transliterations from the verb tables with the stress bolded
mine = [element.text for element in soup.find_all("div", "transcription")]

#Save to file
with open("lavo.txt", "w") as output:
    for i in mine:
        output.write('%s\n' % i)

>Solution :

You can use .contents property, cast it to string and join it. For example:

import requests
from bs4 import BeautifulSoup as bs

# load webpage content
r = requests.get("https://www.pealim.com/dict/28-lavo/")

# Convert to a soup object
soup = bs(r.content, "html.parser")

# Find the transliterations from the verb tables with the stress bolded
mine = [
    "".join(map(str, element.contents))
    for element in soup.find_all("div", "transcription")
]

with open("lavo.txt", "w") as output:
    for i in mine:
        output.write("%s\n" % i)

Saves lavo.txt:

b<b>a</b>
ba'<b>a</b>
ba'<b>i</b>m
ba'<b>o</b>t
b<b>a</b>ti
b<b>a</b>nu
b<b>a</b>ta
b<b>a</b>t
bat<b>e</b>m

...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading