Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

python regex, split string with multiple delimeters

I know this question has been answered but my use case is slightly different. I am trying to setup a regex pattern to split a few strings into a list.

Input Strings:

1. "ABC-QWERT01"
2. "ABC-QWERT01DV"
3. "ABCQWER01"

Criteria of the string
ABC – QWERT 01 DV
1 2 3 4 5

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  1. The string will always start with three chars
  2. The dash is optional
  3. there will then be 3-10 chars
  4. Left padded 0-99 digits
  5. the suffix is 2 chars and is optional

Expected Output

1. ['ABC','-','QWERT','01']
1. ['ABC','-','QWERT','01', 'DV']
1. ['ABC','QWER','01','DV']

I have tried the following patterns a bunch of different ways but I am missing something. My thought was start at the beginning of the string, split after the first three chars or the dash, then split on the occurrence of two decimals.

Pattern 1: r"([ -?, \d{2}])+"
This works but doesn’t break up the string by the first three chars if the dash is missing

Pattern 2: r"([^[a-z]{3}, -?, \d{2}])+"
This fails as a non-pattern match, nothing gets split

Pattern 3: r"([^[a-z]{3}|-?, \d{2}])+"
This fails as a non-pattern match, nothing gets split

Any tips or suggestions?

>Solution :

You can use a pattern similar to :

(?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)

Code:

import re


def _parts(s):
    p = r'(?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)'
    return re.findall(p, s)


print(_parts('ABC-QWERT01DV'))
print(_parts('ABCQWER01'))
print(_parts('ABC-QWERT01'))

Prints

[('ABC', '-', 'QWERT', '01', 'DV')]
[('ABC', '', 'QWER', '01', '')]
[('ABC', '-', 'QWERT', '01', '')]

Notes:

  • (?i): insensitive flag.
  • ([A-Z]{3}): capture group 1 with any 3 letters.
  • (-?): capture group 2 with an optional dash.
  • ([A-Z]*): capture group 3 with 0 or more letters.
  • ([0-9]{2}): capture group 4 with 2 digits.
  • ([A-Z]*): capture group 5 with 0 or more letters.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading