I am trying to find a regex command that can treat text such as ‘Oliver R. Smoot’ and treat it as one sentence and not multiple. Here is the code I am currently running at present:
text = 'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
sent = []
m = re.split(r'(?<=[^A-Z].[.?!]) (?=[A-Z])', text)
for i in m:
sent.append(i)
print(sent)
The output returns a list like so treating it as 2 sentences:
['It is named after Oliver R.', 'Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']
The regex used works well on pretty much every other case except like the one mentioned previously.
>Solution :
Unfortunately, this is not possible using regular expressions, as any sentence can start with (in your particular example) "Smoot". Writing a regular expression to handle all cases where the part following the dot is a name is therefore not feasible.
I highly suggest you to use a sentence tokenizer, such as nltk.sent_tokenize. Example (You may need to install ‘nltk’ and ‘sentencepiece’, e.g. in Colab: !pip install nltk sentencepiece):
import nltk
nltk.download('punkt')
text = 'Hello World. It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
nltk.sent_tokenize(text)
outputs
['Hello World.',
'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']
which is what you want.