regex capture groups with optional group

February 1, 2022

I’m trying to write a regex pattern capturing four different groups, the first groups ends either when we encounter either _ne or _re or a dot,
The second groups is an optional one, it captures the re or ne if encountered, otherwise it’s empty, the third and fourth group are a bit easier to capture as they are just words proceeded by a dot.
here is a code snippet to get a sample data:

import pandas as pd 
sample = pd.Series(["abc_ne.c.d", "kc_E5_re.c.d", "kc_E5_re13.c.d", "kc_E5.c.d"]).rename('raw')

using the following pattern (\w+)(?:_(ne|re)\d*)\.(\w*)\.(\w*) I can capture most cases

	raw	0	1	2	3
0	abc_ne.c.d	abc	ne	c	d
1	kc_E5_re.c.d	kc_E5	re	c	d
2	kc_E5_re13.c.d	kc_E5	re	c	d
3	kc_E5.c.d	nan	nan	nan	nan

the exception is when the second group is absent, in which case it fails:
I tried making it optional (\w+)(?:_(ne|re)\d*)?\\.(\w*)\.(\w*)
but it captures everything in the first groups up to the dot.

	raw	0	1	2	3
0	abc_ne.c.d	abc_ne	nan	c	d
1	kc_E5_re.c.d	kc_E5_re	nan	c	d
2	kc_E5_re13.c.d	kc_E5_re13	nan	c	d
3	kc_E5.c.d	kc_E5	nan	c	d

This snippet could be used to capture groups with pandas if needed:

pattern = r'(\w+)(?:_(ne|re)\d*)?\.(\\w*)\.(\w*)'
sample.to_frame().join(sample.str.extract(pattern))

The expected output is:

	raw	0	1	2	3
0	abc_ne.c.d	abc	ne	c	d
1	kc_E5_re.c.d	kc_E5	re	c	d
2	kc_E5_re13.c.d	kc_E5	re	c	d
3	kc_E5.c.d	kc_E5	nan	c	d

Can anyone help me get the pattern right ?

Thanks in advance.

>Solution :

I’d say you probably want the 2nd group in an optional non-capture group and make the characters captured by the 1st group lazy:

^(\w+?)(?:_([nr]e\d*))?\.(\w+)\.(\w+)$

See an online demo

^ – Start-line anchor;
(\w+?) – 1st Capture group to catch 1+ (Lazy) word-characters (thus including underscore);
(?:_([nr]e\d*))? – Optional non-capture group to match an underscore and an nested 2nd capture group to match both ‘re’ or ‘ne’ followed by 0+ digits;
\.(\w+)\.(\w+) – Match both the 3rd and 4th capture group in succession inbetween literal dots;
$ – End-line anchor.