Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to apply tf.keras.preprocessing.text.Tokenizer on tf.data.TextLineDataset?

I am loading a TextLineDataset and I want to apply a tokenizer trained on a file:

import tensorflow as tf

data = tf.data.TextLineDataset(filename)

MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts([x.numpy().decode('utf-8') for x in train_data])

Now I want to apply this tokenizer on data so that each word is replaced with its encoded value. I have tried data.map(lambda x: tokenizer.texts_to_sequences(x)) which gives OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

Following the instruction, when I write the code as:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

@tf.function
def fun(x):
    return tokenizer.texts_to_sequences(x)
train_data.map(lambda x: fun(x))

I get: OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

So how to do the tokenization on data?

>Solution :

The problem is that tf.keras.preprocessing.text.Tokenizer is not meant to be used in graph mode. Check the docs, both fit_on_texts and texts_to_sequences require lists of strings and not tensors. I would recommend using tf.keras.layers.TextVectorization, but if you really want to use the Tokenizer approach, try something like this:

import tensorflow as tf
import numpy as np

with open('data.txt', 'w') as f:
  f.write('this is a very important sentence \n')
  f.write('where is my cat actually?\n')
  f.write('fish are everywhere!\n')

dataset = tf.data.TextLineDataset(['/content/data.txt'])

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([n.numpy().decode("utf-8")for n in list(dataset.map(lambda x: x))])

def tokenize(x):
  return tokenizer.texts_to_sequences([x.numpy().decode("utf-8")])

dataset = dataset.map(lambda x: tf.py_function(tokenize, [x], Tout=[tf.int32])[0])

for d in dataset:
  print(d)
tf.Tensor([2 1 3 4 5 6], shape=(6,), dtype=int32)
tf.Tensor([ 7  1  8  9 10], shape=(5,), dtype=int32)
tf.Tensor([11 12 13], shape=(3,), dtype=int32)

Using the TextVectorization layer would look something like this:

with open('data.txt', 'w') as f:
  f.write('this is a very important sentence \n')
  f.write('where is my cat actually?\n')
  f.write('fish are everywhere!\n')

dataset = tf.data.TextLineDataset(['/content/data.txt'])

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
vectorize_layer.adapt(dataset)

dataset = dataset.map(vectorize_layer)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading