csv module splits within quotes with custom separator

May 1, 2023

I’d like the below code to avoid splitting within double quotes, but it does:

import csv
from io import StringIO

contents = """
gene "Tagln2"; note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"; product "transgelin-2"; protein_id "NP_848713.1"; tag "RefSeq Select"; exon_number "4";
"""

for l in csv.reader(StringIO(contents), delimiter=";", quotechar='"', skipinitialspace=True, quoting=csv.QUOTE_MINIMAL):
    print(l)

outputs:

['gene "Tagln2"', 'note "putative', 'transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"', 'product "transgelin-2"', 'protein_id "NP_848713.1"', 'tag "RefSeq Select"', 'exon_number "4"', '']

You can see that it splits within the double quotes so that note "putative; transgelin 2" becomes ['note "putative', 'transgelin 2']. How do I fix this?

>Solution :

Seems a good usecase for shlex :

import shlex

lexer = shlex.shlex(contents, posix=True)
lexer.whitespace_split = True
lexer.whitespace = ";"

out1 = [e.strip() for e in lexer if e.strip() != ""]

Or if you wanna preserve the quotes, you can use split :

import re

out2 = [e.strip() for e in re.split(r'(?<=")\s*;\s*', contents) if e != ""]

Output :

print(out1)

['gene Tagln2',
 'note putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)',
 'product transgelin-2',
 'protein_id NP_848713.1',
 'tag RefSeq Select',
 'exon_number 4']

print(out2)

['gene "Tagln2"',
 'note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"',
 'product "transgelin-2"',
 'protein_id "NP_848713.1"',
 'tag "RefSeq Select"',
 'exon_number "4"']