Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why atomic grouping in Python is slower than a simple non-capturing alternation?

Python re module does not have atomic grouping, it can be however emulated, for example see this Q&A. I was under the impression that using atomic grouping would be faster than using a simple alternation regex, due to the fact that it will not try all the alternatives in the group. But this does not hold, see below:

import re
import timeit
import random

random.seed(42)

words = ["tricky", "liquid", "sleepy", "crowded", "half", "secretary", "roll", "educate", "medical", "closed",
         "unaccountable", "earthy", "permit", "pleasant", "confuse", "enter", "land", "encourage", "connection",
         "mindless", "spicy",
         "cracker", "twist"]

atomic_group = re.compile(
    r"(?=(unaccountable|l(?:and|iquid)|half|t(?:ricky|wist)|roll|p(?:ermit|leasant)|s(?:picy|ecretary|leepy)|e(?:ducate|arthy|n(?:ter|courage))|m(?:indless|edical)|c(?:r(?:owded|acker)|on(?:fuse|nection)|losed)))\1")
non_atomic_group = re.compile(
    r"(?:unaccountable|l(?:and|iquid)|half|t(?:ricky|wist)|roll|p(?:ermit|leasant)|s(?:picy|ecretary|leepy)|e(?:ducate|arthy|n(?:ter|courage))|m(?:indless|edical)|c(?:r(?:owded|acker)|on(?:fuse|nection)|losed))")

sentence = " ".join(random.choices(words, k=10000))

print(timeit.timeit("atomic_group.findall(sentence)",
                    setup="from __main__ import atomic_group, sentence",
                    number=10))

print(timeit.timeit("non_atomic_group.findall(sentence)",
                    setup="from __main__ import non_atomic_group, sentence",
                    number=10))

Output

0.036498754999999994
0.028361783

The same behaviour is observed for larger datasets as is shown in the following plot:
enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In the plot len(data) represents an increasing number of sentences (strings formed by 60 words). The code to reproduce it can be found here.

Is my assumption incorrect? On a more general note how can I write a regular expression (in Python) that will only try one of the branches in an alternation regex and none of the others?

>Solution :

Your assumption is not correct. The whole point of atomic patterns is to prevent backtracking into the pattern.

The atomic_group pattern is of (?=(...))\1 type in your code and the non-atomic one is of (?:...) type. So, the first one already loses to the second one due to the use of a capturing group, see capturing group VS non-capturing group.

Besides, you need to match the strings twice with the atomic_group pattern, first, with the lookahead, second, with the backreference.

So, only use this techinque when you need to control backtracking inside a longer pattern.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading