Home Is there a regex command that can handle names with uppercase followed by . and then uppercase name again?

Questions

Is there a regex command that can handle names with uppercase followed by . and then uppercase name again?

February 15, 2022

I am trying to find a regex command that can treat text such as ‘Oliver R. Smoot’ and treat it as one sentence and not multiple. Here is the code I am currently running at present:

text = 'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
sent = []
m = re.split(r'(?<=[^A-Z].[.?!]) (?=[A-Z])', text)
for i in m:
    sent.append(i)
print(sent)

The output returns a list like so treating it as 2 sentences:

['It is named after Oliver R.', 'Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']

The regex used works well on pretty much every other case except like the one mentioned previously.

>Solution :

Unfortunately, this is not possible using regular expressions, as any sentence can start with (in your particular example) "Smoot". Writing a regular expression to handle all cases where the part following the dot is a name is therefore not feasible.

I highly suggest you to use a sentence tokenizer, such as nltk.sent_tokenize. Example (You may need to install ‘nltk’ and ‘sentencepiece’, e.g. in Colab: !pip install nltk sentencepiece):

import nltk

nltk.download('punkt')

text = 'Hello World. It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
nltk.sent_tokenize(text)

outputs

['Hello World.',
'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']

which is what you want.