Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Support tokenizers with binary models #197

Open
fozziethebeat opened this issue Jul 17, 2023 · 3 comments
Open

[Feature request] Support tokenizers with binary models #197

fozziethebeat opened this issue Jul 17, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@fozziethebeat
Copy link

Binary Tokenizers
In general, the feature you want added should be supported by HuggingFace's transformers library:

  • If requesting a model, it must be listed here.
  • If requesting a pipeline, it must be listed here.
  • If requesting a task, it must be listed here.

Some HuggingFace tokenizer models use a binary tokenizer file instead of a JSON configuration. Namely the new Open LLama models.

The current load tokenizer function only checks for json configurations:

async function loadTokenizer(pretrained_model_name_or_path, options) {

    let info = await Promise.all([
        getModelJSON(pretrained_model_name_or_path, 'tokenizer.json', true, options),
        getModelJSON(pretrained_model_name_or_path, 'tokenizer_config.json', true, options),
    ])
    return info;
}

In order to support binary models, this probably needs to be generalized.

Reason for request
Why is it important that we add this feature? What is your intended use case? Remember, we are more likely to add support for models/pipelines/tasks that are popular (e.g., many downloads), or contain functionality that does not exist (e.g., new input type).

This is required for doing javascript tokenization with OpenLLama models.

Additional context
Add any other context or screenshots about the feature request here.

@fozziethebeat fozziethebeat added the enhancement New feature or request label Jul 17, 2023
@xenova
Copy link
Collaborator

xenova commented Jul 17, 2023

I'd recommend exporting from python with .save_pretrained(...) - it should also save the tokenizer.json file (needed by transformers.js). Do you know what the structure of the .model file is? If it's not too difficult, it might be worth adding... but usually those models are read by some other library like sentencepiece.

@fozziethebeat
Copy link
Author

Downloading and re-uploading is probably the easiest user solution for now.

The ideal scenario is being able to use one of these base model's tokenizer as is, but I suspect it'll take some work for that to happen.

Given the complexity, this might be out of scope, but i wanted to make the feature request just so that there's at least a discussion and others could check it if they run into the same issue. I'm not sure how the LLamaTokenizer works but I'm just going to assume its a sentencepiece model binary.

@xenova
Copy link
Collaborator

xenova commented Jul 17, 2023

If the open llama models use the same tokenizer as the original llama models, you can reuse the tokenizer as follows:

import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]';

let tokenizer = await AutoTokenizer.from_pretrained('hf-internal-testing/llama-tokenizer');
let model_inputs = await tokenizer("Hello there");
console.log(model_inputs);
// {
//   attention_mask: Tensor {
//     data: BigInt64Array(3) [1n, 1n, 1n],
//     dims: [1, 3],
//     size: 3,
//     type: "int64",
//   },
//   input_ids: Tensor {
//     data: BigInt64Array(3) [1n, 15043n, 727n],
//     dims: [1, 3],
//     size: 3,
//     type: "int64",
//   },
// }

Given the complexity, this might be out of scope, but i wanted to make the feature request just so that there's at least a discussion and others could check it if they run into the same issue. I'm not sure how the LLamaTokenizer works but I'm just going to assume its a sentencepiece model binary.

Since I have implemented the LLamaTokenizer already - and the open llama models you linked above do seem to use the same tokenizer structure ("tokenizer_class": "LlamaTokenizer"), I most likely won't support the .model files, meaning you can either:

  1. reuse an existing llama tokenizer (if it does in fact match; looking at the tokenizer_config.json files, they look quite similar)
  2. use .save_pretrained() to output the tokenizer.json file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants