Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to export Tokenizer-s? #2015

Open
fdtomasi opened this issue Dec 9, 2024 · 4 comments
Open

How to export Tokenizer-s? #2015

fdtomasi opened this issue Dec 9, 2024 · 4 comments

Comments

@fdtomasi
Copy link

fdtomasi commented Dec 9, 2024

I am encountering issues in exporting text tokenizers to be served for tf-serving as part of a tf.Graph.

To Reproduce

import tensorflow as tf
import keras
from keras_nlp.models import GPT2CausalLMPreprocessor

tokenizer = GPT2CausalLMPreprocessor.from_preset("gpt2_base_en")
tokenizer.build(None)

export_archive = keras.export.ExportArchive()
export_archive.track(tokenizer)
export_archive.add_endpoint(
    name="generate",
    fn=lambda x: tokenizer(x)[0],
    input_signature=[
        tf.TensorSpec(shape=[None], dtype=tf.string, name="inputs")
    ],
)
export_archive.write_out("test/export")

This should not return errors, but I get the following:

AssertionError: Tried to export a function which references an 'untracked' resource. TensorFlow objects (e.g. tf.Variable) captured by functions must be 'tracked' by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly. See the information below:
	Function name = b'__inference_signature_wrapper_<lambda>_56462'
	Captured Tensor = <ResourceHandle(name="table_49686", device="/job:localhost/replica:0/task:0/device:CPU:0", container="localhost", type="tensorflow::lookup::LookupInterface", dtype and shapes : "[  ]")>
	Trackable referencing this tensor = <tensorflow.python.ops.lookup_ops.MutableHashTable object at 0x7f87081371f0>
	Internal Tensor = Tensor("56442:0", shape=(), dtype=resource)

I am explicitly tracking the tokenizer as according to this https://keras.io/api/models/model_saving_apis/export/#track-method it seems to be required when using lookup tables, but it seems it is not enough.
I am using keras_hub == 0.17.0, keras == 3.7.0, tensorflow == 2.18.0.
Thanks!

@mehtamansi29
Copy link
Collaborator

Hi @fdtomasi -

You can not directly export GPT2CausalLMPreprocessor tokenizers.
Tokenizers can using python serialization pickling like this:

import pickle
with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

Or create custom keras model for which take tokenizer inputs and then save like this:

class TokenizerModel(keras.Model):
    def __init__(self, tokenizer):
        super(TokenizerModel, self).__init__()
        self.tokenizer = tokenizer
    def call(self, inputs):
        encoded = self.tokenizer(inputs)
        return encoded[0]

tokenizer_model = TokenizerModel(tokenizer)
tf.saved_model.save(tokenizer_model, "tokenizer_model")

Attached gist for your refrence here.

@fdtomasi
Copy link
Author

Hi @mehtamansi29 thank you for looking into this.

Yes, I am aware we can save the tokenizers using tf.saved_model.save directly, and even using the following directly

tokenizer = GPT2CausalLMPreprocessor.from_preset("gpt2_base_en")
tf.saved_model.save(tokenizer, "tokenizer_model")

works for me.

However even after wrapping the tokenizer using a keras.Model, export and ExportArchive do not work for me:

class TokenizerModel(keras.Model):
    def __init__(self, tokenizer):
        super(TokenizerModel, self).__init__()
        self.tokenizer = tokenizer

    def call(self, inputs):
        encoded = self.tokenizer(inputs)
        return encoded[0]
tokenizer_model = TokenizerModel(tokenizer)

# Build the model
tokenizer_model("text")
tokenizer_model(ops.convert_to_tensor(["test","test1"]))
tokenizer_model.export("test/export")

which returns the same error. Since I am trying to use the tokenizers as part of tf-serving function, I think I am in need of using the export function? What would be the best way of serializing the tokenizers?

@mehtamansi29
Copy link
Collaborator

Hi @fdtomasi -

For exporting text tokenizers to be served for tf-serving, you can use TextVectorization layer with desired parameters and build some vocabulary with dummy data and then you above TokenizerModel class and then build TokenizerModel before exporting it.

text_vectorization = TextVectorization(max_tokens=10000,output_mode='int',output_sequence_length=100)
dummy_input_data = tf.data.Dataset.from_tensor_slices(["sample text","another example", "more text data"])
text_vectorization.adapt(dummy_input_data)

class TokenizerModel(keras.Model):
    def __init__(self, text_vectorization):
        super(TokenizerModel, self).__init__()
        self.text_vectorization = text_vectorization

    def call(self, inputs):
        return self.text_vectorization(inputs)
def tokenization_func(inputs):
    return text_vectorization(inputs)

tokenizer_model = TokenizerModel(text_vectorization)
tokenizer_model.build(input_shape=(None,))

text_data = ["This is a new sentence", "Another example of text"]
vectorized_data = text_vectorization(text_data)

export_archive = keras.export.ExportArchive()
export_archive.track(tokenizer_model.text_vectorization)
export_archive.add_endpoint(
    name="tokenizer",
    fn=tokenization_func,
    input_signature=[
        tf.TensorSpec(shape=(None,), dtype=tf.string),
    ],
)
export_archive.write_out("/content")

Attached gist for reference which shows export tokenizer and example for loading exported tokenizer.

@fdtomasi
Copy link
Author

I do not think this suggestion work in general. When exporting a tokenizer we should ensure that the tokenizer is the same that was used during training, with the same behaviour. Even when using a TextVectorization layer with the same vocabulary, the transformations done on the text may not match with the original vocabulary, leading to a different tokenization and hence different final result.

The only issue with exporting the original GPT2CausalLMPreprocessor is the untracked ResourceHandle included in the BytePairTokenizerCache class, so is there a more general solution to manually track such resource in order to be correctly exported? I have been trying by adding the resource_handle into the main object but with no avail. Could it be something related to the
ByteTokenizerCache that it is not properly marked as exportable (and tensors within are not tracked)?
https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/tokenizers/byte_pair_tokenizer.py#L138
Even the option of signalling keras to disregard the original cache would be enough, since the cache does not need to be serialized in theory. However the export is trying to find the original ResourceHandle object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants