Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

csv module splits within quotes with custom separator

I’d like the below code to avoid splitting within double quotes, but it does:

import csv
from io import StringIO

contents = """
gene "Tagln2"; note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"; product "transgelin-2"; protein_id "NP_848713.1"; tag "RefSeq Select"; exon_number "4";
"""

for l in csv.reader(StringIO(contents), delimiter=";", quotechar='"', skipinitialspace=True, quoting=csv.QUOTE_MINIMAL):
    print(l)

outputs:

['gene "Tagln2"', 'note "putative', 'transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"', 'product "transgelin-2"', 'protein_id "NP_848713.1"', 'tag "RefSeq Select"', 'exon_number "4"', '']

You can see that it splits within the double quotes so that note "putative; transgelin 2" becomes ['note "putative', 'transgelin 2']. How do I fix this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Seems a good usecase for shlex :

import shlex
​
lexer = shlex.shlex(contents, posix=True)
lexer.whitespace_split = True
lexer.whitespace = ";"
​
out1 = [e.strip() for e in lexer if e.strip() != ""]

Or if you wanna preserve the quotes, you can use split :

import re

out2 = [e.strip() for e in re.split(r'(?<=")\s*;\s*', contents) if e != ""]

Output :

print(out1)

['gene Tagln2',
 'note putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)',
 'product transgelin-2',
 'protein_id NP_848713.1',
 'tag RefSeq Select',
 'exon_number 4']

print(out2)

['gene "Tagln2"',
 'note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"',
 'product "transgelin-2"',
 'protein_id "NP_848713.1"',
 'tag "RefSeq Select"',
 'exon_number "4"']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading