Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

spacy matcher pattern IN + REGEX Tag

My goal is to match with spacy the sentences that contain one of the following words:
[‘studium’,’abschluss’,’ausbildung’]

I can solve the problem with this line:

pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]

My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How can use the regex inside the IN sentence to match all words containing the word Studium?

>Solution :

You can use the REGEX operator:

import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]

Note:

  • map(re.escape, l) – escapes the items in the l list
  • "|".join(...) – joins the words as alternatives (word1|word2|wordN)
  • ^(?:...|[^\W\d_]*studium)$ – a regex that matches
    • ^ – start of string (here, token)
    • (?:...|[^\W\d_]*studium) – a non-capturing group matching any of the l items or any zero or more letters ([^\W\d_]*) followed with studium
    • $ – end of string (token here).
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading