-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose Encoder in TiktokenTokenizer #7313
Comments
@razshare could you please elaborate more why you need to expose it and how you are planning to use it? I read the description, but more details about your scenario will help here. |
@tarekgh of course. |
Hello @tarekgh , as promissed here's a more in depth explanation. LLM based client applications often require being able to count the number tokens in a given string.
These are 2 examples I'm actively dealing with atm, I would imagine there are other reasons too, which I have yet to encounter. In most cases, for the client to be able to count how many token a given string actually contains, they would simply invoke var tokenizer = TiktokenTokenizer.CreateForModel(modelId); to create a tokenizer and then proceed to add the logic for counting the tokens in a string var numberOfTokens = tokenizer.CountToken(myInputString); However, sometimes due to constraints out of our control, the client application cannot stay 100% all the time up to date with the latest changes in In my case, I need to offer backward compatibility on the LLM side of things. Some clients are not able to update their client application, which means their version of If a client application has an out of date So with that in mind, there are some cases in which the first invocation will fail var tokenizer = TiktokenTokenizer.CreateForModel("my-new-fancy-model-from-year-2030"); because the application, as built at the time, would not contain encodings for model All is not lost though, because machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs Lines 1278 to 1284 in 5090327
Which is agnostic to the model name/id, it just takes in the raw encoder dictionary. This opens the gates to a solution in which the server plays a role, in order to solve this backward compatibility issue. When the client is out of date and is unaware of a specific model name/id, we fallback to the server, we retrieve the encodings of said model and finally create a new tokenizer using directly the encodings. var buffer = UTF8Encoding.UTF8.GetBytes(base64Encodings);
var stream = new MemoryStream(buffer);
var tokenizer = TiktokenTokenizer.Create(stream); Note The After that, the client can proceed to count the token as usual. Note And ofc the client-side application may even cache these encodings locally, so that the next time it encounters a request for said new and shiny model, it doesn't have to query the server, but instead it would use the cached encodings. Currently, retrieving these raw encodings on the server side can only be done through reflection, as shown in the original test file. machinelearning/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs Lines 732 to 740 in 5090327
This is probably fine for you, the authors. But we're not authoring, and accessing internal properties this way doesn't guarantee that these properties won't change one day without being tagged with a major release number. On top of that, it's reflection, it has a performance impact as well. Note Although it's probably very small. Hence the solution: make the encoder public. Let me know if this clarifies the reason for this change, what problem it's actually trying to solve, and perhaps if you think there's a better, more ergonomic, more future-proof (and so on) approach. |
@razshare thanks a lot for the details. It is super helpful. One follow-up question, is the server always in control of the source of the tokenizer data? I mean, can the server always create the tokenizer using the stream (instead of calling CreateForModel)? If you can do that, it will be simpler for the server to just stream the content to the client without any processing. (I mean will avoid getting the encoder, encode it as UTF-8 base64 and send it to the client). By the way, I am not objecting to your proposal, I am just brainstorming how to support the scenario in an efficient way. If we need to go with your proposal, we may think about exposing tokenizer Create method that allows taking the Encoder data too. Also, calling Create method passing the stream only will not enough as you need to pass the pre-tokenizer and special tokens too. |
Hello again @tarekgh !
In my case, the server itself never creates the actual tokenizer instance, there's no invocation for
If by
Yeah, I left those out for the time being in order to focus on the encoder specifically, also because we haven't wrestled with that part so far, we've just been omitting those parameters for the sake of simplicity in order to get the architecture to work and solve the backward compatibility issue. Pre-tokenizer and special tokens will come after, for the moment I'm aiming to allow the client to successfully create a tokenizer from a remote encoder dictionary.
Without converting it to a stream? |
How you do that today? I mean how you retrieve the encodings from the server? |
The server exposes a simple http endpoint which, through reflection as mentioned above, gets the encoder dictionary. |
Thanks @razshare! You said earlier |
Oh, you're right, I misspoke! You're completely right, it does create an instance!
You're not, I'm the one mumbling! Let me know if there's anything else I can try clarify. |
Hello, first of all thank your very much for this project!
Is your feature request related to a problem? Please describe.
Yes, it is.
Some of our clients may have outdated encodings on their client application.
We still want our clients to have access to new encodings even if their client application is not up to date, hence we want to serve the encoder dictionary from a server endpoint.
A clear and concise description of what the problem is.
The problem is that, currently, the Encoder property in TiktokenTokenizer is internal.
machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Lines 998 to 1001 in 5090327
Describe the solution you'd like
I would like to expose this Encoder property.
There seems to be the intent to expose this property at some point in the future.
machinelearning/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs
Lines 732 to 740 in 5090327
Maybe this is the time to do it, what do you think?
Describe alternatives you've considered
Maybe a separate method that does exactly what that test from above does using reflection.
Sounds like overkill and a lot of overhead though.
Exposing the property is probably the best way to deal with this.
Additional context
I'm sending a PR your way with the changes, feel free to ask for/make any modifications you think are necessary.
The text was updated successfully, but these errors were encountered: