Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

how to get mentions in pytorch NER instead of toknes?

I am using PyTorch and a pre-trained model.

Here is my code:

class NER(object):
    def __init__(self, model_name_or_path, tokenizer_name_or_path):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name_or_path)
        self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer)

    def get_mention_entities(self, query):
        return self.nlp(query)

when I call get_mention_entities and print its output for "اینجا دانشگاه صنعتی امیرکبیر است."

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

it gives:

[{'entity': 'B-FAC', 'score': 0.9454591, 'index': 2, 'word': 'دانشگاه', 'start': 6, 'end': 13}, {'entity': 'I-FAC', 'score': 0.9713519, 'index': 3, 'word': 'صنعتی', 'start': 14, 'end': 19}, {'entity': 'I-FAC', 'score': 0.9860724, 'index': 4, 'word': 'امیرکبیر', 'start': 20, 'end': 28}]

As you can see, it can recognize the university name, but there are three tokens in the list.

Is there any standard way to combine these tokens based on the "entity" attribute?

desired output is something like:

[{'entity': 'FAC', 'word': 'دانشگاه صنعتی امیرکبیر', 'start': 6, 'end': 28}]

Finally, I can write a function to iterate, compare, and merge the tokens based on the "entity" attribute, but I want a standard way like an internal PyTorch function or something like this.

my question is similar to this question.

PS: "دانشگاه صنعتی امیرکبیر" is a university name.

>Solution :

Huggingface’s NER pipeline has an argument grouped_entities=True which will do exactly what you seek: group BI into unified entities.

Adding

self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer, grouped_entities=True)

should do the trick

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading