I’d like the below code to avoid splitting within double quotes, but it does:
import csv
from io import StringIO
contents = """
gene "Tagln2"; note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"; product "transgelin-2"; protein_id "NP_848713.1"; tag "RefSeq Select"; exon_number "4";
"""
for l in csv.reader(StringIO(contents), delimiter=";", quotechar='"', skipinitialspace=True, quoting=csv.QUOTE_MINIMAL):
print(l)
outputs:
['gene "Tagln2"', 'note "putative', 'transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"', 'product "transgelin-2"', 'protein_id "NP_848713.1"', 'tag "RefSeq Select"', 'exon_number "4"', '']
You can see that it splits within the double quotes so that note "putative; transgelin 2" becomes ['note "putative', 'transgelin 2']. How do I fix this?
>Solution :
Seems a good usecase for shlex :
import shlex
lexer = shlex.shlex(contents, posix=True)
lexer.whitespace_split = True
lexer.whitespace = ";"
out1 = [e.strip() for e in lexer if e.strip() != ""]
Or if you wanna preserve the quotes, you can use split :
import re
out2 = [e.strip() for e in re.split(r'(?<=")\s*;\s*', contents) if e != ""]
Output :
print(out1)
['gene Tagln2',
'note putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)',
'product transgelin-2',
'protein_id NP_848713.1',
'tag RefSeq Select',
'exon_number 4']
print(out2)
['gene "Tagln2"',
'note "putative; transgelin 2 (MGD|MGI:1312985 GB|BC049861, evidence: BLASTN, 99%, match=1379)"',
'product "transgelin-2"',
'protein_id "NP_848713.1"',
'tag "RefSeq Select"',
'exon_number "4"']