Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How expensive is accessing Match.group()?

Trying to optimize some code that reuses a matched group, I was wondering whether accessing Match.group() is expensive. I tried to dig in re.py‘s source, but the code was a bit cryptic.

A few tests seem to indicate that it might be better to store the output of Match.group() in a variable, but I would like to understand what exactly happens when Match.group() is called, and if there is another internal way to maybe access the content of the group directly.

Some example code to illustrate a potential use:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import re

m = re.search('X+', f'__{"X"*10000}__')

# do something
# m.group()

# do something else
# m.group()
Timings

direct access:

%%timeit
len(m.group())
220 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

intermediate variable:

X = m.group()
%%timeit
len(X)
# 51 ns ± 0.172 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

References:
current re.py code (python 3.10)
current sre_compile.py code (python 3.10)

removing the effect of attribute access (doesn’t change much)

G = m.group

%%timeit
len(G())
230 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

>Solution :

The match object holds a reference to the original string you searched in, and indexes where each group starts and ends, including group 0, the whole matched string. Every call to group() slices the original string to create a new string to return.

Saving the return value to a variable avoids the time and memory cost of having to slice the string every time. (It also avoids repeating the method call overhead.)

You can see that group() isn’t just returning a cached string by the fact that the return value isn’t always the same object:

>>> import re
>>> x = re.search(r'sd', 'asdf')
>>> x.group() is x.group()
False

If you want to see the implementation of group(), it’s match_group in Modules/_sre.c in the Python source code.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading