spacy matcher pattern IN + REGEX Tag

My goal is to match with spacy the sentences that contain one of the following words:
[‘studium’,’abschluss’,’ausbildung’]

I can solve the problem with this line:

pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]

My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.

How can use the regex inside the IN sentence to match all words containing the word Studium?

>Solution :

You can use the REGEX operator:

import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]

Note:

  • map(re.escape, l) – escapes the items in the l list
  • "|".join(...) – joins the words as alternatives (word1|word2|wordN)
  • ^(?:...|[^\W\d_]*studium)$ – a regex that matches
    • ^ – start of string (here, token)
    • (?:...|[^\W\d_]*studium) – a non-capturing group matching any of the l items or any zero or more letters ([^\W\d_]*) followed with studium
    • $ – end of string (token here).

Leave a Reply