The problem: i have a .csv dataset for ML classifier, it has 2 columns and prints like:
But this dataset is very dirty, and i decided to open it with Excel -> remove "dirty" words -> save it as .csv file and learn my ML classifier on good dataset. But after i saved it in Excel (using "," separator and also tried ", UTF-8") – it gives me the error Error tokenizing data. C error: Expected 3 fields in line 4, saw 5 when trying read_csv
Then i tried to use sep=';' in read_csv, it worked, but now all russian characters are replaced with strange symbols:
The question: can somebody explain please how to repair "question"-symbols from russian characters? Encoding='UTF-8' gives error 'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte.
When i open first (not modified Excel .csv file):
When i open second (modified):
>Solution :
Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.



