Match all text between \w: placeholders

April 20, 2022

I need to match text between undefined \w: patterns (so n: text, foo: text and n: text foo: more text, more examples in the test script below).

To do this, I’m using python’s finditer and a regex, but I can’t capture more multiple words between placeholders. How can I adjust either the regex, or the finditer method to do what I want?

import re

def test_query_parse_regex(query, expected_result):
    result = {}

    # perform the matching here, this needs to change
    r = r"([\w-]+):\s?([\w-]*)"
    matches = re.finditer(r, query)

    for match in matches:
        # eg 'n'
        operator = match.group(1).strip()
        # eg 'text'
        operator_value = match.group(2).strip()

    # build a dict for comparison
    result[operator] = operator_value
    if result == expected_result:
        print(f"PASS: {query}")
    else:
        print(f"FAIL: {query}")
        print(f"  Expected: {expected_result}")
        print(f"  Got     : {result}")


checks = [
    # Query, expected
    ("n: tom", {"n": "tom"}),
    ("n: tom preston", {"n": "tom preston"}),
    ("n: tom l: london", {"n": "tom", "l": "london"}),
    ("n: tom preston l: london derry", {"n": "tom preston", "l": "london derry"}),
]

for check in checks:
    test_query_parse_regex(*check)

Note. I’ve tried a positive look ahead but can’t make that work either: r"([\w-]+):\s?([\w-]*)(?=\w:)"

>Solution :

You can use

r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|$)"
r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|\Z)"

Note that if your strings can have line breaks you will need to also amend the re.finditer part to

re.finditer(r, query, re.DOTALL)

See the regex demo. Prefer the version with \Z if you use the re.M or re.MULTILINE option since \Z always matches the very end of string.

Details:

([\w-]+) – Group 1: one or more word or hyphen chars
:\s* – a colon and any zero or more whitespaces
(.*?) – Group 2: zero or more chars other than line break chars (if re.DOTALL is not used) as few as possible
(?=[\w-]+:|\Z) – a positive lookahead that requires one or more word or hyphen chars followed with a colon, or end of string, immediately to the right of the current location.