Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas.read_csv() Decoding Error tokenizing data because of a comma in data

I am having trouble reading in a csv that contains a comma within a row value.

An example row including the data causing the issue (afaik) is as follows:

[‘true’,47,’y’,’descriptive_evidence’,’n’,’true’,66,[81,65]]

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I think that the [81,65] entry is being scanned literally and thus treated as two entries [81 and 65]. Is there any way to override this in pandas, or do i have to manually replace the comma prior to reading into a dataframe?

From reading other answers, I am aware of the possibility of skipping rows using something like error_bad_lines=False, but in this case i cant afford to skip these entries.

Best Wishes 🙂

>Solution :

You could try sep with regex but it will be using python engine and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:

1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]
pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")
|     | 1      | 2   | 3   | 4                      | 5   | 6      | 7   | 8       |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0   | 'true' | 47  | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66  | [81,65] |

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading