Pandas.read_csv() Decoding Error tokenizing data because of a comma in data

June 30, 2022

I am having trouble reading in a csv that contains a comma within a row value.

An example row including the data causing the issue (afaik) is as follows:

[‘true’,47,’y’,’descriptive_evidence’,’n’,’true’,66,[81,65]]

I think that the [81,65] entry is being scanned literally and thus treated as two entries [81 and 65]. Is there any way to override this in pandas, or do i have to manually replace the comma prior to reading into a dataframe?

From reading other answers, I am aware of the possibility of skipping rows using something like error_bad_lines=False, but in this case i cant afford to skip these entries.

Best Wishes 🙂

>Solution :

You could try sep with regex but it will be using python engine and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:

1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]

pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")

|     | 1      | 2   | 3   | 4                      | 5   | 6      | 7   | 8       |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0   | 'true' | 47  | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66  | [81,65] |