Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

BART Tokenizer tokenises same word differently?

I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')

print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))

I get the following output

['Th', 'ames', 'Ä is', 'Ä a', 'Ä river']
['We', 'Ä are', 'Ä in', 'Ä London', '.', 'Ä Thames', 'Ä is', 'Ä a', 'Ä river']

I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Many thanks

>Solution :

According to https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/tokenization_bart.py:

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not. You can get around that behavior by passing add_prefix_space=True when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

Trying

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn', add_prefix_space=True)

print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))

yields the ‘correct’ result to me.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading