Extracting start and end indices of a token using spacy

byMR

May 10, 2022

I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.

For example, the input is: This is a sentence written in english by a native English speaker.

And What I want is the span of the word ‘English’ which in this case is : (30,37) and (50, 57).

Note: I was pointed to this answer (Get position of word in sentence with spacy)

But this answer doesn’t solve my problem. It can help me in getting the start character of the token but not the end index.

All help appreciated

>Solution :

You can do this with re in pure python:

s="This is a sentence written in english by a native English speaker."

import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]

#output
[(30, 37), (50, 57)]

You can do in spacy as well:

import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
    if ent.text.upper()=='ENGLISH':
      print(ent.start_char,ent.end_char)