Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas read_csv – Error tokenizing data after modifying Excel .csv file

The problem: i have a .csv dataset for ML classifier, it has 2 columns and prints like:

enter image description here

But this dataset is very dirty, and i decided to open it with Excel -> remove "dirty" words -> save it as .csv file and learn my ML classifier on good dataset. But after i saved it in Excel (using "," separator and also tried ", UTF-8") – it gives me the error Error tokenizing data. C error: Expected 3 fields in line 4, saw 5 when trying read_csv

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Then i tried to use sep=';' in read_csv, it worked, but now all russian characters are replaced with strange symbols:

enter image description here

The question: can somebody explain please how to repair "question"-symbols from russian characters? Encoding='UTF-8' gives error 'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte.

When i open first (not modified Excel .csv file):

enter image description here

When i open second (modified):

enter image description here

>Solution :

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading