Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Gettin non-english text from html doc

I’m trying to get a title of html document in python, but getting weird symbols. I guess that’s because of encoding, but the html doc in utf-8 encoding.
Is there any way I can get normal letters?

Here is code and what am I getting:

from bs4 import BeautifulSoup

 with open("index.html") as file:
     src = file.read()


soup = BeautifulSoup(src, "lxml")

title = soup.title.text

print(title)

Главная страница

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You need to specify an encoding type when opening the file:

 with open("index.html", encoding='utf-8') as file:
     src = file.read()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading