Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python: Parsing data containing both types of quotation as well as special characters

Hi All I am working on a project where I need to parse some data containing both " and ‘ quotation marks as well as special characters. While the data is confidential and therefore cannot be posted on here the text below replicates the issue.

"""
Brian: \n\t"I am not the messiah" \nArthur:\n\t "I say you are Lord and I should know I've followed a few"

The end goal is to get the text in the form:

['Brian:', '"I am not the messiah"', 'Arthur:', '"I say you are Lord and I should know I've followed a few"']

That is to say all newline and tab characters removed, splitting on newlines (though this is read from a file so .readlines() takes care of that) and any spaces but not within double (") quotation marks.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The code

import shlex as sh
line_info = sh.split(line.removesuffix("\n").replace("\t", " "))

comes close to success but but fails to retain the quotations marks (I don’t need the quotation marks themselves but I do need an indication the text was quoted for further processing)

>Solution :

I think the problem lies in the shlex module, stripping too much. With the str.split() method, everything seems to work like expected:

import io

text = """
Brian: \n\t"I am not the messiah" \nArthur:\n\t "I say you are Lord and I should know I've followed a few"
"""

result = [line.strip() for line in io.StringIO(text).readlines() if line.strip()]

expectation = [
    'Brian:',
    '"I am not the messiah"',
    'Arthur:',
    '"I say you are Lord and I should know I\'ve followed a few"'
]


assert result == expectation

I am faking here the string as a file object to apply the readlines method as you did. Then it is only some stripping needed. FWIW: The code could be optimized, here it is a double iteration. But I want to be as close as your code, you have mentioned.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading