Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to decode strange symbols from parser (bs4) into Cyrillic?

I tried to import ‘lxml’ and to find what encoding this is but for no success. Websites with decoding functions can’t transfer it back to Cyrillic. Only Windows-1250 and ISO-8859-1 can encode SOME symbols in the text.

import os 
import requests 
from bs4 import BeautifulSoup

gismeteo = 'https://www.gismeteo.ua/ua/weather-novomoskovsk-10961/weekly/'

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0' }
req1 = requests.get(gismeteo, headers=headers)

data1 = BeautifulSoup(req1.text, 'html.parser')

day_a1 = data1.find('div', class_='widget-row widget-row-days-date')
day_b1 = str([da1.text.replace('\n', '').strip() for da1 in day_a1])

print(day_b1)

Sometimes output is like this (good):

['Нд11 вер', 'Пн12', 'Вт13', 'Ср14', 'Чт15', 'Пт16', 'Сб17'] 

And sometimes it like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

['Ð\x9dд11 веÑ\x80', 'Ð\x9fн12', 'Ð\x92Ñ\x8213', 'СÑ\x8014', 'ЧÑ\x8215', 'Ð\x9fÑ\x8216', 'Сб17']

>Solution :

I don’t really know why requests sometimes fails to use the right encoding (I got it right on the first run, and wrong afterwards…) but you can set it manually before accessing the text:

req1.encoding = 'utf8'
data1 = BeautifulSoup(req1.text, 'html.parser')

and this gives you reliably:

['Нд11 вер', 'Пн12', 'Вт13', 'Ср14', 'Чт15', 'Пт16', 'Сб17']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading