Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting start and end indices of a token using spacy

I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.

For example, the input is: This is a sentence written in english by a native English speaker.

And What I want is the span of the word ‘English’ which in this case is : (30,37) and (50, 57).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Note: I was pointed to this answer (Get position of word in sentence with spacy)

But this answer doesn’t solve my problem. It can help me in getting the start character of the token but not the end index.

All help appreciated

>Solution :

You can do this with re in pure python:

s="This is a sentence written in english by a native English speaker."

import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]

#output
[(30, 37), (50, 57)]

You can do in spacy as well:

import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
    if ent.text.upper()=='ENGLISH':
      print(ent.start_char,ent.end_char)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading