Pandas read_csv – Error tokenizing data after modifying Excel .csv file

November 20, 2021

The problem: i have a .csv dataset for ML classifier, it has 2 columns and prints like:

But this dataset is very dirty, and i decided to open it with Excel -> remove "dirty" words -> save it as .csv file and learn my ML classifier on good dataset. But after i saved it in Excel (using "," separator and also tried ", UTF-8") – it gives me the error Error tokenizing data. C error: Expected 3 fields in line 4, saw 5 when trying read_csv

Then i tried to use sep=';' in read_csv, it worked, but now all russian characters are replaced with strange symbols:

The question: can somebody explain please how to repair "question"-symbols from russian characters? Encoding='UTF-8' gives error 'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte.

When i open first (not modified Excel .csv file):