Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex doesnt stop after sign

Hi I have regex like this

(.*(?=\sI+)*) (.*)

But it doesn’t capture groups correctly as I need.

For this example data :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  1. Vladimir Goth
  2. Langraab II Landgraab
  3. Léa Magdalena III Rouault Something
  4. Anna Maria Teodora
  5. Léa Maria Teodora II

1,2 are only correctly captured.

So what I need is

  • If there is no I+ is split by first space.
  • If after I+ there are other words first gorup should contains all to I+. So, group1 for 3rd example should be Léa Magdalena III
  • If after I+ there aren’t any other words like in example 5, group1 should be capture to first space.

@Edit
I+ should be replaced by roman numbers

>Solution :

If you want to support any Roman numbers you can use

^(\S+(?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)

If you need to support Roman numbers up to XX (exclusive):

^(\S+(?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)

See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.

Details:

  • ^ – start of string
  • ( – Group 1 start:
    • \S+ – one or more non-whitespaces
    • (?: – a non-capturing group:
      • .* – any zero or more chars other than line break chars as many as possible
      • \b – a word boundary
      • (?=[MDCLXVI]) – require at least one Roman digit immediately to the right
      • M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) – a Roman number pattern
      • \b – a word boundary
      • (?= +\S) – a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
    • )? – end of the non-capturing group, repeat one or zero times (it is optional)
  • ) – end of the first group
  • + – one or more spaces
  • (.*) – Group 2: the rest of the line.

In Java:

String regex = "^(\\S+(?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h+\\S))?)\\h+(.*)";
// Or
String regex = "^(\\S+(?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s+\S))?)\\s+(.*)";
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading