Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial release #22

Merged
merged 11 commits into from
Oct 7, 2024
Merged

Initial release #22

merged 11 commits into from
Oct 7, 2024

Conversation

hendrikvanantwerpen
Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen commented Oct 3, 2024

The bpe crate names was released, so I released an initial version of the crate to claim the name.

Things changed:

  • Added some fields to the crate manifest.
  • Changed the serialization of token dictionaries. The problem was that <crates.io> has a size limit of 10MB, while our serialized BPE instances were around 15MB and 30MB. For now I've opted to serialize the token lists + hash factor, and build the BPE instance in the lazy function. The performance impact of this is only relevant when the values are initialized, which is only once per run. But I'm happy to iterate ont his if necessary.

I decided to go ahead with the release to make sure we got the name. I imagine we do another release soon with additional changes or polishing we think are necessary.

Copy link
Collaborator

@aneubeck aneubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I added the serialized representation in the first place was because building the aho corasick data structures from scratch takes several seconds!

I.e. this is a cost, we don't want to pay when running blackbird tests (or when other ppl depend on this).
One option could be to improve the aho-corasick serialization, since we have our own fork anyways. But, I didn't check how much data it actually has. It might just be tricky to get it small, since with have 100k resp. 200k tokens in it. And you need to store probably a couple of numbers for every character in them. So, making it small might be challenging...

@aneubeck
Copy link
Collaborator

aneubeck commented Oct 4, 2024

I looked at the serialization code of the daaghorse crate and it doesn't look like we can safe a ton there.
The only thing which might work is to run gzip over the serialized representation. Maybe that's good enough?
In principle, it is possible to have assets stored outside of the binary: rust-lang/cargo#11683
But this doesn't avoid the problem of the 10MB limit either.

not great...

@hendrikvanantwerpen
Copy link
Contributor Author

I.e. this is a cost, we don't want to pay when running blackbird tests (or when other ppl depend on this).

This may not be as bad as it could be. I added a print statement to the initialization and it looks like this happens only once for all tests. That was in this crate though, so it could be that in Blackbird it might happens once per crate (if the tokenizers are used directly or indirectly).

I'll try compression and see if that gives us anything.

@hendrikvanantwerpen
Copy link
Contributor Author

hendrikvanantwerpen commented Oct 4, 2024

Just compressing the serialized BPE reduces the size, but not enough to get us under the 10MB limit:

gz (best) zlib (best) raw
cl100k 8.4M 8.1M 15M
o200k 17M 17M 31M

I wondered if we could use u16 numbers, but for the o200k automaton, the number of states requires at least 19 bits.

@hendrikvanantwerpen
Copy link
Contributor Author

@aneubeck The serialization changes in #24 were based on this PR, so most changes here are now the same. Mostly some additional manifest changes.

@@ -0,0 +1,42 @@
# OpenAI Byte Pair Encoders

Fast tokenizers for OpenAI token sets based on the [bpe](https://crates.io/crates/bpe) crate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a warning that this is crate is NOT replicating the regex "word splitting" used by openAI.
Therefore, results will differ!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the warning and also a test that shows an example of the issue.

@hendrikvanantwerpen hendrikvanantwerpen merged commit 8f53c50 into main Oct 7, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the initial-release branch October 7, 2024 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants