Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there a regex command that can handle names with uppercase followed by . and then uppercase name again?

I am trying to find a regex command that can treat text such as ‘Oliver R. Smoot’ and treat it as one sentence and not multiple. Here is the code I am currently running at present:

text = 'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
sent = []
m = re.split(r'(?<=[^A-Z].[.?!]) (?=[A-Z])', text)
for i in m:
    sent.append(i)
print(sent)

The output returns a list like so treating it as 2 sentences:

['It is named after Oliver R.', 'Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']

The regex used works well on pretty much every other case except like the one mentioned previously.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Unfortunately, this is not possible using regular expressions, as any sentence can start with (in your particular example) "Smoot". Writing a regular expression to handle all cases where the part following the dot is a name is therefore not feasible.

I highly suggest you to use a sentence tokenizer, such as nltk.sent_tokenize. Example (You may need to install ‘nltk’ and ‘sentencepiece’, e.g. in Colab: !pip install nltk sentencepiece):

import nltk

nltk.download('punkt')

text = 'Hello World. It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.'
nltk.sent_tokenize(text)

outputs

['Hello World.',
'It is named after Oliver R. Smoot, a fraternity pledge to Lambda Chi Alpha , who in October 1958 lay on the Harvard Bridge (between Boston and Cambridge , Massachusetts ), and was used by his fraternity brothers to measure the length of the bridge.']

which is what you want.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading