How to export `Tokenizer`-s? #2015

fdtomasi · 2024-12-09T16:50:52Z

I am encountering issues in exporting text tokenizers to be served for tf-serving as part of a tf.Graph.

To Reproduce

import tensorflow as tf
import keras
from keras_nlp.models import GPT2CausalLMPreprocessor

tokenizer = GPT2CausalLMPreprocessor.from_preset("gpt2_base_en")
tokenizer.build(None)

export_archive = keras.export.ExportArchive()
export_archive.track(tokenizer)
export_archive.add_endpoint(
    name="generate",
    fn=lambda x: tokenizer(x)[0],
    input_signature=[
        tf.TensorSpec(shape=[None], dtype=tf.string, name="inputs")
    ],
)
export_archive.write_out("test/export")

This should not return errors, but I get the following:

AssertionError: Tried to export a function which references an 'untracked' resource. TensorFlow objects (e.g. tf.Variable) captured by functions must be 'tracked' by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly. See the information below:
	Function name = b'__inference_signature_wrapper_<lambda>_56462'
	Captured Tensor = <ResourceHandle(name="table_49686", device="/job:localhost/replica:0/task:0/device:CPU:0", container="localhost", type="tensorflow::lookup::LookupInterface", dtype and shapes : "[  ]")>
	Trackable referencing this tensor = <tensorflow.python.ops.lookup_ops.MutableHashTable object at 0x7f87081371f0>
	Internal Tensor = Tensor("56442:0", shape=(), dtype=resource)

I am explicitly tracking the tokenizer as according to this https://keras.io/api/models/model_saving_apis/export/#track-method it seems to be required when using lookup tables, but it seems it is not enough.
I am using keras_hub == 0.17.0, keras == 3.7.0, tensorflow == 2.18.0.
Thanks!

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2024-12-10T17:01:15Z

Hi @fdtomasi -

You can not directly export GPT2CausalLMPreprocessor tokenizers.
Tokenizers can using python serialization pickling like this:

import pickle
with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

Or create custom keras model for which take tokenizer inputs and then save like this:

class TokenizerModel(keras.Model):
    def __init__(self, tokenizer):
        super(TokenizerModel, self).__init__()
        self.tokenizer = tokenizer
    def call(self, inputs):
        encoded = self.tokenizer(inputs)
        return encoded[0]

tokenizer_model = TokenizerModel(tokenizer)
tf.saved_model.save(tokenizer_model, "tokenizer_model")

Attached gist for your refrence here.

fdtomasi · 2024-12-10T17:27:45Z

Hi @mehtamansi29 thank you for looking into this.

Yes, I am aware we can save the tokenizers using tf.saved_model.save directly, and even using the following directly

tokenizer = GPT2CausalLMPreprocessor.from_preset("gpt2_base_en")
tf.saved_model.save(tokenizer, "tokenizer_model")

works for me.

However even after wrapping the tokenizer using a keras.Model, export and ExportArchive do not work for me:

class TokenizerModel(keras.Model):
    def __init__(self, tokenizer):
        super(TokenizerModel, self).__init__()
        self.tokenizer = tokenizer

    def call(self, inputs):
        encoded = self.tokenizer(inputs)
        return encoded[0]
tokenizer_model = TokenizerModel(tokenizer)

# Build the model
tokenizer_model("text")
tokenizer_model(ops.convert_to_tensor(["test","test1"]))
tokenizer_model.export("test/export")

which returns the same error. Since I am trying to use the tokenizers as part of tf-serving function, I think I am in need of using the export function? What would be the best way of serializing the tokenizers?

mehtamansi29 · 2024-12-11T16:47:14Z

Hi @fdtomasi -

For exporting text tokenizers to be served for tf-serving, you can use TextVectorization layer with desired parameters and build some vocabulary with dummy data and then you above TokenizerModel class and then build TokenizerModel before exporting it.

text_vectorization = TextVectorization(max_tokens=10000,output_mode='int',output_sequence_length=100)
dummy_input_data = tf.data.Dataset.from_tensor_slices(["sample text","another example", "more text data"])
text_vectorization.adapt(dummy_input_data)

class TokenizerModel(keras.Model):
    def __init__(self, text_vectorization):
        super(TokenizerModel, self).__init__()
        self.text_vectorization = text_vectorization

    def call(self, inputs):
        return self.text_vectorization(inputs)
def tokenization_func(inputs):
    return text_vectorization(inputs)

tokenizer_model = TokenizerModel(text_vectorization)
tokenizer_model.build(input_shape=(None,))

text_data = ["This is a new sentence", "Another example of text"]
vectorized_data = text_vectorization(text_data)

export_archive = keras.export.ExportArchive()
export_archive.track(tokenizer_model.text_vectorization)
export_archive.add_endpoint(
    name="tokenizer",
    fn=tokenization_func,
    input_signature=[
        tf.TensorSpec(shape=(None,), dtype=tf.string),
    ],
)
export_archive.write_out("/content")

Attached gist for reference which shows export tokenizer and example for loading exported tokenizer.

fdtomasi · 2024-12-11T17:06:24Z

I do not think this suggestion work in general. When exporting a tokenizer we should ensure that the tokenizer is the same that was used during training, with the same behaviour. Even when using a TextVectorization layer with the same vocabulary, the transformations done on the text may not match with the original vocabulary, leading to a different tokenization and hence different final result.

The only issue with exporting the original GPT2CausalLMPreprocessor is the untracked ResourceHandle included in the BytePairTokenizerCache class, so is there a more general solution to manually track such resource in order to be correctly exported? I have been trying by adding the resource_handle into the main object but with no avail. Could it be something related to the
ByteTokenizerCache that it is not properly marked as exportable (and tensors within are not tracked)?
https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/tokenizers/byte_pair_tokenizer.py#L138
Even the option of signalling keras to disregard the original cache would be enough, since the cache does not need to be serialized in theory. However the export is trying to find the original ResourceHandle object.

github-actions bot assigned sachinprasadhs Dec 9, 2024

mehtamansi29 assigned mehtamansi29 and unassigned sachinprasadhs Dec 10, 2024

mehtamansi29 added type:Bug Something isn't working stat:awaiting response from contributor labels Dec 10, 2024

mehtamansi29 added stat:awaiting response from contributor and removed stat:awaiting response from contributor labels Dec 11, 2024

mehtamansi29 added the keras-team-review-pending label Dec 11, 2024

VarunS1997 assigned mattdangerw Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to export `Tokenizer`-s? #2015

How to export `Tokenizer`-s? #2015

fdtomasi commented Dec 9, 2024 •

edited

Loading

mehtamansi29 commented Dec 10, 2024

fdtomasi commented Dec 10, 2024

mehtamansi29 commented Dec 11, 2024

fdtomasi commented Dec 11, 2024

How to export Tokenizer-s? #2015

How to export Tokenizer-s? #2015

Comments

fdtomasi commented Dec 9, 2024 • edited Loading

mehtamansi29 commented Dec 10, 2024

fdtomasi commented Dec 10, 2024

mehtamansi29 commented Dec 11, 2024

fdtomasi commented Dec 11, 2024

How to export `Tokenizer`-s? #2015

How to export `Tokenizer`-s? #2015

fdtomasi commented Dec 9, 2024 •

edited

Loading