Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Match all text between \w: placeholders

I need to match text between undefined \w: patterns (so n: text, foo: text and n: text foo: more text, more examples in the test script below).

To do this, I’m using python’s finditer and a regex, but I can’t capture more multiple words between placeholders. How can I adjust either the regex, or the finditer method to do what I want?

import re

def test_query_parse_regex(query, expected_result):
    result = {}

    # perform the matching here, this needs to change
    r = r"([\w-]+):\s?([\w-]*)"
    matches = re.finditer(r, query)

    for match in matches:
        # eg 'n'
        operator = match.group(1).strip()
        # eg 'text'
        operator_value = match.group(2).strip()

    # build a dict for comparison
    result[operator] = operator_value
    if result == expected_result:
        print(f"PASS: {query}")
    else:
        print(f"FAIL: {query}")
        print(f"  Expected: {expected_result}")
        print(f"  Got     : {result}")


checks = [
    # Query, expected
    ("n: tom", {"n": "tom"}),
    ("n: tom preston", {"n": "tom preston"}),
    ("n: tom l: london", {"n": "tom", "l": "london"}),
    ("n: tom preston l: london derry", {"n": "tom preston", "l": "london derry"}),
]

for check in checks:
    test_query_parse_regex(*check)

Note. I’ve tried a positive look ahead but can’t make that work either: r"([\w-]+):\s?([\w-]*)(?=\w:)"

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use

r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|$)"
r = r"([\w-]+):\s*(.*?)(?=[\w-]+:|\Z)"

Note that if your strings can have line breaks you will need to also amend the re.finditer part to

re.finditer(r, query, re.DOTALL)

See the regex demo. Prefer the version with \Z if you use the re.M or re.MULTILINE option since \Z always matches the very end of string.

Details:

  • ([\w-]+) – Group 1: one or more word or hyphen chars
  • :\s* – a colon and any zero or more whitespaces
  • (.*?) – Group 2: zero or more chars other than line break chars (if re.DOTALL is not used) as few as possible
  • (?=[\w-]+:|\Z) – a positive lookahead that requires one or more word or hyphen chars followed with a colon, or end of string, immediately to the right of the current location.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading