Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

regex capture groups with optional group

I’m trying to write a regex pattern capturing four different groups, the first groups ends either when we encounter either _ne or _re or a dot,
The second groups is an optional one, it captures the re or ne if encountered, otherwise it’s empty, the third and fourth group are a bit easier to capture as they are just words proceeded by a dot.
here is a code snippet to get a sample data:

import pandas as pd 
sample = pd.Series(["abc_ne.c.d", "kc_E5_re.c.d", "kc_E5_re13.c.d", "kc_E5.c.d"]).rename('raw')

using the following pattern (\w+)(?:_(ne|re)\d*)\.(\w*)\.(\w*) I can capture most cases

raw 0 1 2 3
0 abc_ne.c.d abc ne c d
1 kc_E5_re.c.d kc_E5 re c d
2 kc_E5_re13.c.d kc_E5 re c d
3 kc_E5.c.d nan nan nan nan

the exception is when the second group is absent, in which case it fails:
I tried making it optional (\w+)(?:_(ne|re)\d*)?\\.(\w*)\.(\w*)
but it captures everything in the first groups up to the dot.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

raw 0 1 2 3
0 abc_ne.c.d abc_ne nan c d
1 kc_E5_re.c.d kc_E5_re nan c d
2 kc_E5_re13.c.d kc_E5_re13 nan c d
3 kc_E5.c.d kc_E5 nan c d

This snippet could be used to capture groups with pandas if needed:

pattern = r'(\w+)(?:_(ne|re)\d*)?\.(\\w*)\.(\w*)'
sample.to_frame().join(sample.str.extract(pattern))

The expected output is:

raw 0 1 2 3
0 abc_ne.c.d abc ne c d
1 kc_E5_re.c.d kc_E5 re c d
2 kc_E5_re13.c.d kc_E5 re c d
3 kc_E5.c.d kc_E5 nan c d

Can anyone help me get the pattern right ?

Thanks in advance.

>Solution :

I’d say you probably want the 2nd group in an optional non-capture group and make the characters captured by the 1st group lazy:

^(\w+?)(?:_([nr]e\d*))?\.(\w+)\.(\w+)$

See an online demo


  • ^ – Start-line anchor;
  • (\w+?) – 1st Capture group to catch 1+ (Lazy) word-characters (thus including underscore);
  • (?:_([nr]e\d*))? – Optional non-capture group to match an underscore and an nested 2nd capture group to match both ‘re’ or ‘ne’ followed by 0+ digits;
  • \.(\w+)\.(\w+) – Match both the 3rd and 4th capture group in succession inbetween literal dots;
  • $ – End-line anchor.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading