Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Spacy incorrectly identifying pronouns

When I try this code using Spacy, I get the desired result:

import spacy
nlp = spacy.load("en_core_web_sm")

# example 1
test = "All my stuff is at to MyBOQ"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)  

The output shows All and my. However, if I add a question mark:

test = "All my stuff is at to MyBOQ?"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)

now it also identifies MyBOQ as a pronoun. It should be classified as an organization name (word.pos_ == 'ORG') instead.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How do I tell Spacy not to classify MyBOQ as a pronoun? Should I just remove all punctuation before checking for pronouns?

>Solution :

When running your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4), spaCy produces the following results for the text with and without the question mark:

                               en_core_web_sm   en_core_web_md   en_core_web_trf
All my stuff is at to MyBOQ?   All, my          my               my
All my stuff is at to MyBOQ    All, my          my               my

In this example, the word "All" is not a pronoun but rather a determiner, so only the en_core_web_md and en_core_web_trf pipelines are producing technically correct results. If you’re running an old version of spaCy I’d suggest updating the package. Alternatively, if spaCy is up-to-date, try restarting your IDE/computer to see if it stops producing erroneous results—there should be no need to remove punctuation before checking for pronouns.

Finally, Part of Speech (PoS) tags do not include organisation names (ORG). I think you’re mixing Named Entity tags with PoS tags. "MyBOQ" should be PoS tagged as a proper noun (PROPN) which the en_core_web_md and en_core_web_trf pipelines identify correctly, whereas en_core_web_sm pipeline does not (instead tagging it as a basic NOUN).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading