Home Python: Parsing data containing both types of quotation as well as special characters

Questions

Python: Parsing data containing both types of quotation as well as special characters

July 8, 2022

Hi All I am working on a project where I need to parse some data containing both " and ‘ quotation marks as well as special characters. While the data is confidential and therefore cannot be posted on here the text below replicates the issue.

"""
Brian: \n\t"I am not the messiah" \nArthur:\n\t "I say you are Lord and I should know I've followed a few"

The end goal is to get the text in the form:

['Brian:', '"I am not the messiah"', 'Arthur:', '"I say you are Lord and I should know I've followed a few"']

That is to say all newline and tab characters removed, splitting on newlines (though this is read from a file so .readlines() takes care of that) and any spaces but not within double (") quotation marks.

The code

import shlex as sh
line_info = sh.split(line.removesuffix("\n").replace("\t", " "))

comes close to success but but fails to retain the quotations marks (I don’t need the quotation marks themselves but I do need an indication the text was quoted for further processing)

>Solution :

I think the problem lies in the shlex module, stripping too much. With the str.split() method, everything seems to work like expected:

import io

text = """
Brian: \n\t"I am not the messiah" \nArthur:\n\t "I say you are Lord and I should know I've followed a few"
"""

result = [line.strip() for line in io.StringIO(text).readlines() if line.strip()]

expectation = [
    'Brian:',
    '"I am not the messiah"',
    'Arthur:',
    '"I say you are Lord and I should know I\'ve followed a few"'
]


assert result == expectation

I am faking here the string as a file object to apply the readlines method as you did. Then it is only some stripping needed. FWIW: The code could be optimized, here it is a double iteration. But I want to be as close as your code, you have mentioned.