Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex (Kotlin) to match end of sentence periods only and ignore periods in the middle such as abbreviations

I need a regex to find all sentence-ending periods and ignore middle of the sentence periods, such as in abbreviations.
Note: I understand that there are many other variations, and it may not be possible to account for all of them, so the focus of the question would be : can at least the below sample be solved with a regex?

Suppose I have this text. The regex rule below finds any period matches followed by a white space. But it also matches p.m. and U.S. – how can I ignore periods in a word that a) consists of characters all separated by a period? (such as U.S.) and b) a period preceded by one characters only (such as J.).
This is in Kotlin.

        val text = "At 12.51 p.m. local time, J. Knapp, former U.S. Navy,  went out for a walk. Yes he did. And then a Mw6.3 earthquake happened."
        val regexRule = "\\.\\s+"
        val splitText = text.split(regexRule.toRegex())
        val result = splitText.joinToString( separator = ".\n\n")

Current result with just that rule:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

At 12.51 p.m.

local time, J.

Knapp, former U.S.

Navy, went out for a walk.

Yes he did.

And then a Mw6.3 earthquake happened.

>Solution :

You can use

val regexRule = "(?<!\\b\\p{L})\\.(?<!\\d.(?=\\d))(?!\\s*\$)\\s*"

See the regex demo.

Details:

  • (?<!\b\p{L}) – a negative lookbehind: no single letter preceded with a word boundary is allowed immediately to the left of the current location
  • \. – a dot
  • (?<!\d.(?=\d)) – the dot should not be in-between digits
  • (?!\s*$) – immediately to the right, there should be no any zero or more whitespaces + the end of the string
  • \s* – any zero or more whitespaces.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading