I need to divide the article into sentences by punctuation. I use the following regular expression:
re.split(r'[,|.|?|!]', strContent)
It does work, but there is a problem. It will separate the following Latin names that should not be split (such as G. lucidum):
Many studies to date have described the anticancer properties of G. lucidum,
The abbreviation of this Latin name is a capital letter followed by a dot and a space.
So I try to modify the above regular expression as follows:
re.split(r'[,|(?:[^A-Z].)|?|!]', strContent)
However, the following error prompt was received:
re.error: unbalanced parenthesis
How can I modify this regular expression?
>Solution :
You should use a negative lookbehind, and put it before the character set that matches the sentence ending.
The negative lookbehind should match a word that’s just a single capital letter. This can be done by matching a word boundary before the letter with \b.
You also don’t need | inside the character set. That’s used for alternative patterns to match.
re.split(r'(?<!\b[A-Z])[,.?!]', strContent)