My goal is to match with spacy the sentences that contain one of the following words:
[‘studium’,’abschluss’,’ausbildung’]
I can solve the problem with this line:
pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]
My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.
How can use the regex inside the IN sentence to match all words containing the word Studium?
>Solution :
You can use the REGEX operator:
import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]
Note:
map(re.escape, l)– escapes the items in thellist"|".join(...)– joins the words as alternatives (word1|word2|wordN)^(?:...|[^\W\d_]*studium)$– a regex that matches^– start of string (here, token)(?:...|[^\W\d_]*studium)– a non-capturing group matching any of thelitems or any zero or more letters ([^\W\d_]*) followed withstudium$– end of string (token here).