[Feature request] Support tokenizers with binary models #197

fozziethebeat · 2023-07-17T07:20:32Z

Binary Tokenizers
In general, the feature you want added should be supported by HuggingFace's transformers library:

If requesting a model, it must be listed here.
If requesting a pipeline, it must be listed here.
If requesting a task, it must be listed here.

Some HuggingFace tokenizer models use a binary tokenizer file instead of a JSON configuration. Namely the new Open LLama models.

The current load tokenizer function only checks for json configurations:

async function loadTokenizer(pretrained_model_name_or_path, options) {

    let info = await Promise.all([
        getModelJSON(pretrained_model_name_or_path, 'tokenizer.json', true, options),
        getModelJSON(pretrained_model_name_or_path, 'tokenizer_config.json', true, options),
    ])
    return info;
}

In order to support binary models, this probably needs to be generalized.

Reason for request
Why is it important that we add this feature? What is your intended use case? Remember, we are more likely to add support for models/pipelines/tasks that are popular (e.g., many downloads), or contain functionality that does not exist (e.g., new input type).

This is required for doing javascript tokenization with OpenLLama models.

Additional context
Add any other context or screenshots about the feature request here.

xenova · 2023-07-17T09:16:52Z

I'd recommend exporting from python with .save_pretrained(...) - it should also save the tokenizer.json file (needed by transformers.js). Do you know what the structure of the .model file is? If it's not too difficult, it might be worth adding... but usually those models are read by some other library like sentencepiece.

fozziethebeat · 2023-07-17T09:44:57Z

Downloading and re-uploading is probably the easiest user solution for now.

The ideal scenario is being able to use one of these base model's tokenizer as is, but I suspect it'll take some work for that to happen.

Given the complexity, this might be out of scope, but i wanted to make the feature request just so that there's at least a discussion and others could check it if they run into the same issue. I'm not sure how the LLamaTokenizer works but I'm just going to assume its a sentencepiece model binary.

xenova · 2023-07-17T10:53:05Z

If the open llama models use the same tokenizer as the original llama models, you can reuse the tokenizer as follows:

import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]';

let tokenizer = await AutoTokenizer.from_pretrained('hf-internal-testing/llama-tokenizer');
let model_inputs = await tokenizer("Hello there");
console.log(model_inputs);
// {
//   attention_mask: Tensor {
//     data: BigInt64Array(3) [1n, 1n, 1n],
//     dims: [1, 3],
//     size: 3,
//     type: "int64",
//   },
//   input_ids: Tensor {
//     data: BigInt64Array(3) [1n, 15043n, 727n],
//     dims: [1, 3],
//     size: 3,
//     type: "int64",
//   },
// }

Given the complexity, this might be out of scope, but i wanted to make the feature request just so that there's at least a discussion and others could check it if they run into the same issue. I'm not sure how the LLamaTokenizer works but I'm just going to assume its a sentencepiece model binary.

Since I have implemented the LLamaTokenizer already - and the open llama models you linked above do seem to use the same tokenizer structure ("tokenizer_class": "LlamaTokenizer"), I most likely won't support the .model files, meaning you can either:

reuse an existing llama tokenizer (if it does in fact match; looking at the tokenizer_config.json files, they look quite similar)
use .save_pretrained() to output the tokenizer.json file

fozziethebeat added the enhancement New feature or request label Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Support tokenizers with binary models #197

[Feature request] Support tokenizers with binary models #197

fozziethebeat commented Jul 17, 2023

xenova commented Jul 17, 2023

fozziethebeat commented Jul 17, 2023

xenova commented Jul 17, 2023 •

edited

Loading

[Feature request] Support tokenizers with binary models #197

[Feature request] Support tokenizers with binary models #197

Comments

fozziethebeat commented Jul 17, 2023

xenova commented Jul 17, 2023

fozziethebeat commented Jul 17, 2023

xenova commented Jul 17, 2023 • edited Loading

xenova commented Jul 17, 2023 •

edited

Loading