Expose Encoder in TiktokenTokenizer #7313

razshare · 2024-11-15T14:32:08Z

Hello, first of all thank your very much for this project!

Is your feature request related to a problem? Please describe.
Yes, it is.
Some of our clients may have outdated encodings on their client application.
We still want our clients to have access to new encodings even if their client application is not up to date, hence we want to serve the encoder dictionary from a server endpoint.

A clear and concise description of what the problem is.
The problem is that, currently, the Encoder property in TiktokenTokenizer is internal.

machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs

Lines 998 to 1001 in 5090327

    
           /// <summary> 
        
           /// Gets the dictionary mapping token bytes to Ids. 
        
           /// </summary> 
        
           internal IReadOnlyDictionary<ReadOnlyMemory<byte>, int> Encoder => _encoder;

Describe the solution you'd like
I would like to expose this Encoder property.
There seems to be the intent to expose this property at some point in the future.

machinelearning/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs

Lines 732 to 740 in 5090327

    
           // We are not exposing the Encoder, Decoder, or Vocabulary so far. For now, use reflection to test it. 
        
           private static IReadOnlyDictionary<ReadOnlyMemory<byte>, int>? GetEncoder(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Encoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<ReadOnlyMemory<byte>, int>; 
        
           private static IReadOnlyDictionary<int, ReadOnlyMemory<byte>>? GetDecoder(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Decoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<int, ReadOnlyMemory<byte>>; 
        
           private static IReadOnlyDictionary<string, int>? GetVocabulary(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Vocabulary", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<string, int>;

Maybe this is the time to do it, what do you think?

Describe alternatives you've considered
Maybe a separate method that does exactly what that test from above does using reflection.
Sounds like overkill and a lot of overhead though.
Exposing the property is probably the best way to deal with this.

Additional context
I'm sending a PR your way with the changes, feel free to ask for/make any modifications you think are necessary.

tarekgh · 2024-11-18T20:43:46Z

@razshare could you please elaborate more why you need to expose it and how you are planning to use it? I read the description, but more details about your scenario will help here.

razshare · 2024-11-19T12:03:28Z

@tarekgh of course.
I'll come back to you with a more in depth explanation and possibly some drawings/schemas to make it easier to understand.

razshare · 2024-11-20T16:56:50Z

Hello @tarekgh , as promissed here's a more in depth explanation.

LLM based client applications often require being able to count the number tokens in a given string.
The reasons can be multiple.

Sometimes it's necessary to limit the context sent to the LLM in order to reduce costs.
Other times when dealing with a server it is necessary to avoid being rate limited, and so counting the tokens before sending them can help with that.

These are 2 examples I'm actively dealing with atm, I would imagine there are other reasons too, which I have yet to encounter.
The point being is that counting tokens client-side is useful.
Most of the times the client application can simply make use of the TiktokenTokenizer class itself.

In most cases, for the client to be able to count how many token a given string actually contains, they would simply invoke

var tokenizer = TiktokenTokenizer.CreateForModel(modelId);

to create a tokenizer and then proceed to add the logic for counting the tokens in a string

var numberOfTokens = tokenizer.CountToken(myInputString);

However, sometimes due to constraints out of our control, the client application cannot stay 100% all the time up to date with the latest changes in TiktokenTokenizer.

In my case, I need to offer backward compatibility on the LLM side of things.

Some clients are not able to update their client application, which means their version of TiktokenTokenizer would become outdated pretty fast at the pace at which new models seem to come out.

If a client application has an out of date TiktokenTokenizer, it should still be able to interact with new models and count tokens locally by simply changing the model id in a configuration panel.

So with that in mind, there are some cases in which the first invocation will fail

var tokenizer = TiktokenTokenizer.CreateForModel("my-new-fancy-model-from-year-2030");

because the application, as built at the time, would not contain encodings for model my-new-fancy-model-from-year-2030.

All is not lost though, because TiktokenTokenizer.Create exists.

machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs

Lines 1278 to 1284 in 5090327

    
           public static TiktokenTokenizer Create( 
        
                                   Stream vocabStream, 
        
                                   PreTokenizer? preTokenizer, 
        
                                   Normalizer? normalizer, 
        
                                   IReadOnlyDictionary<string, int>? specialTokens = null, 
        
                                   int cacheSize = LruCache<int[]>.DefaultCacheSize) 
        
               => new TiktokenTokenizer(vocabStream, preTokenizer, specialTokens, normalizer, cacheSize);

Which is agnostic to the model name/id, it just takes in the raw encoder dictionary.

This opens the gates to a solution in which the server plays a role, in order to solve this backward compatibility issue.

When the client is out of date and is unaware of a specific model name/id, we fallback to the server, we retrieve the encodings of said model and finally create a new tokenizer using directly the encodings.
We do this by using something like this

var buffer = UTF8Encoding.UTF8.GetBytes(base64Encodings);
var stream = new MemoryStream(buffer);
var tokenizer = TiktokenTokenizer.Create(stream);

Note

The base64Encodings variable is the contents of the raw encoder (obtained from the server), encoded in base64, as required by TiktokenTokenizer.Create.

After that, the client can proceed to count the token as usual.

Note

And ofc the client-side application may even cache these encodings locally, so that the next time it encounters a request for said new and shiny model, it doesn't have to query the server, but instead it would use the cached encodings.

Currently, retrieving these raw encodings on the server side can only be done through reflection, as shown in the original test file.

machinelearning/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs

Lines 732 to 740 in 5090327

    
           // We are not exposing the Encoder, Decoder, or Vocabulary so far. For now, use reflection to test it. 
        
           private static IReadOnlyDictionary<ReadOnlyMemory<byte>, int>? GetEncoder(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Encoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<ReadOnlyMemory<byte>, int>; 
        
           private static IReadOnlyDictionary<int, ReadOnlyMemory<byte>>? GetDecoder(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Decoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<int, ReadOnlyMemory<byte>>; 
        
           private static IReadOnlyDictionary<string, int>? GetVocabulary(TiktokenTokenizer tiktoken) 
        
               => typeof(TiktokenTokenizer).GetProperty("Vocabulary", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<string, int>;

This is probably fine for you, the authors.

But we're not authoring, and accessing internal properties this way doesn't guarantee that these properties won't change one day without being tagged with a major release number.
Basically it's dangerous for us, users of the library, to do these kind of things.

On top of that, it's reflection, it has a performance impact as well.

Note

Although it's probably very small.

Hence the solution: make the encoder public.

Let me know if this clarifies the reason for this change, what problem it's actually trying to solve, and perhaps if you think there's a better, more ergonomic, more future-proof (and so on) approach.

tarekgh · 2024-11-21T22:18:35Z

@razshare thanks a lot for the details. It is super helpful. One follow-up question, is the server always in control of the source of the tokenizer data? I mean, can the server always create the tokenizer using the stream (instead of calling CreateForModel)? If you can do that, it will be simpler for the server to just stream the content to the client without any processing. (I mean will avoid getting the encoder, encode it as UTF-8 base64 and send it to the client).

By the way, I am not objecting to your proposal, I am just brainstorming how to support the scenario in an efficient way. If we need to go with your proposal, we may think about exposing tokenizer Create method that allows taking the Encoder data too. Also, calling Create method passing the stream only will not enough as you need to pass the pre-tokenizer and special tokens too.

razshare · 2024-11-22T00:44:50Z

Hello again @tarekgh !

is the server always in control of the source of the tokenizer data? I mean, can the server always create the tokenizer using the stream (instead of calling CreateForModel)

In my case, the server itself never creates the actual tokenizer instance, there's no invocation for TiktokenTokenizer.CreateForModel() or TiktokenTokenizer.Create() on the server.
Only the client calls TiktokenTokenizer.CreateForModel() with a model id.
If it fails then it tries to retrieve the encodings of said model from the server and then tries to call TiktokenTokenizer.Create() instead.

If you can do that, it will be simpler for the server to just stream the content to the client without any processing

If by content you mean the encoder dictionary, then yes, that's exactly it.
And yes, the processing part (encoding to base64) can also be skipped to the on the server.

you need to pass the pre-tokenizer and special tokens too.

Yeah, I left those out for the time being in order to focus on the encoder specifically, also because we haven't wrestled with that part so far, we've just been omitting those parameters for the sake of simplicity in order to get the architecture to work and solve the backward compatibility issue.

Pre-tokenizer and special tokens will come after, for the moment I'm aiming to allow the client to successfully create a tokenizer from a remote encoder dictionary.

we may think about exposing tokenizer Create method that allows taking the Encoder data too

Without converting it to a stream?
It's not strictly necessary, but that sounds like a good quality of life improvement to me. The .Net touch!

tarekgh · 2024-11-25T00:07:07Z

If it fails then it tries to retrieve the encodings of said model from the server and then tries to call TiktokenTokenizer.Create() instead.

How you do that today? I mean how you retrieve the encodings from the server?

razshare · 2024-11-25T23:43:19Z

The server exposes a simple http endpoint which, through reflection as mentioned above, gets the encoder dictionary.
The client simply sends an http request to get that data whenever said models is not available locally on the client.

tarekgh · 2024-11-26T00:55:10Z

Thanks @razshare!

You said earlier In my case, the server itself never creates the actual tokenizer instance, there's no invocation for TiktokenTokenizer.CreateForModel() or TiktokenTokenizer.Create() on the server.. This is not the case then as the server needs to create the tokenizer to get the encoder. Sorry if I am missing something obvious.

razshare · 2024-11-26T08:33:28Z

Oh, you're right, I misspoke!
What I meant to say is that the server doesn't count the tokens itself.

You're completely right, it does create an instance!

Sorry if I am missing something obvious.

You're not, I'm the one mumbling!

Let me know if there's anything else I can try clarify.

razshare added the enhancement New feature or request label Nov 15, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged label Nov 15, 2024

razshare linked a pull request Nov 15, 2024 that will close this issue

Expose Encoder and Decoder in TiktokenTokenizer #7314

Draft

4 tasks

dotnet-policy-service bot added the in-pr label Nov 15, 2024

ericstj assigned tarekgh Nov 18, 2024

tarekgh added Tokenizers needs-author-action and removed untriaged New issue has not been triaged labels Nov 18, 2024

tarekgh added this to the ML.NET Future milestone Nov 18, 2024

dotnet-policy-service bot added needs-further-triage and removed needs-author-action labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Encoder in TiktokenTokenizer #7313

Expose Encoder in TiktokenTokenizer #7313

razshare commented Nov 15, 2024

tarekgh commented Nov 18, 2024 •

edited

Loading

razshare commented Nov 19, 2024 •

edited

Loading

razshare commented Nov 20, 2024 •

edited

Loading

tarekgh commented Nov 21, 2024

razshare commented Nov 22, 2024

tarekgh commented Nov 25, 2024

razshare commented Nov 25, 2024

tarekgh commented Nov 26, 2024

razshare commented Nov 26, 2024

Expose Encoder in TiktokenTokenizer #7313

Expose Encoder in TiktokenTokenizer #7313

Comments

razshare commented Nov 15, 2024

tarekgh commented Nov 18, 2024 • edited Loading

razshare commented Nov 19, 2024 • edited Loading

razshare commented Nov 20, 2024 • edited Loading

tarekgh commented Nov 21, 2024

razshare commented Nov 22, 2024

tarekgh commented Nov 25, 2024

razshare commented Nov 25, 2024

tarekgh commented Nov 26, 2024

razshare commented Nov 26, 2024

tarekgh commented Nov 18, 2024 •

edited

Loading

razshare commented Nov 19, 2024 •

edited

Loading

razshare commented Nov 20, 2024 •

edited

Loading