Initial release #22

hendrikvanantwerpen · 2024-10-03T09:40:37Z

The bpe crate names was released, so I released an initial version of the crate to claim the name.

Things changed:

Added some fields to the crate manifest.
Changed the serialization of token dictionaries. The problem was that <crates.io> has a size limit of 10MB, while our serialized BPE instances were around 15MB and 30MB. For now I've opted to serialize the token lists + hash factor, and build the BPE instance in the lazy function. The performance impact of this is only relevant when the values are initialized, which is only once per run. But I'm happy to iterate ont his if necessary.

I decided to go ahead with the release to make sure we got the name. I imagine we do another release soon with additional changes or polishing we think are necessary.

aneubeck

The reason why I added the serialized representation in the first place was because building the aho corasick data structures from scratch takes several seconds!

I.e. this is a cost, we don't want to pay when running blackbird tests (or when other ppl depend on this).
One option could be to improve the aho-corasick serialization, since we have our own fork anyways. But, I didn't check how much data it actually has. It might just be tricky to get it small, since with have 100k resp. 200k tokens in it. And you need to store probably a couple of numbers for every character in them. So, making it small might be challenging...

aneubeck · 2024-10-04T06:03:01Z

I looked at the serialization code of the daaghorse crate and it doesn't look like we can safe a ton there.
The only thing which might work is to run gzip over the serialized representation. Maybe that's good enough?
In principle, it is possible to have assets stored outside of the binary: rust-lang/cargo#11683
But this doesn't avoid the problem of the 10MB limit either.

not great...

hendrikvanantwerpen · 2024-10-04T12:29:24Z

I.e. this is a cost, we don't want to pay when running blackbird tests (or when other ppl depend on this).

This may not be as bad as it could be. I added a print statement to the initialization and it looks like this happens only once for all tests. That was in this crate though, so it could be that in Blackbird it might happens once per crate (if the tokenizers are used directly or indirectly).

I'll try compression and see if that gives us anything.

hendrikvanantwerpen · 2024-10-04T13:08:38Z

Just compressing the serialized BPE reduces the size, but not enough to get us under the 10MB limit:

	gz (best)	zlib (best)	raw
cl100k	8.4M	8.1M	15M
o200k	17M	17M	31M

I wondered if we could use u16 numbers, but for the o200k automaton, the number of states requires at least 19 bits.

Co-authored-by: Timothy Clem <[email protected]>

Generate serialized data in build script

hendrikvanantwerpen · 2024-10-04T16:42:40Z

@aneubeck The serialization changes in #24 were based on this PR, so most changes here are now the same. Mostly some additional manifest changes.

aneubeck · 2024-10-07T07:39:19Z

crates/bpe-openai/README.md

@@ -0,0 +1,42 @@
+# OpenAI Byte Pair Encoders
+
+Fast tokenizers for OpenAI token sets based on the [bpe](https://crates.io/crates/bpe) crate.


there should be a warning that this is crate is NOT replicating the regex "word splitting" used by openAI.
Therefore, results will differ!

I added the warning and also a test that shows an example of the issue.

hendrikvanantwerpen added 2 commits October 3, 2024 11:14

Add fields to crate manifest

1bca946

Change serialization to make embedded files smaller

68bf766

hendrikvanantwerpen self-assigned this Oct 3, 2024

hendrikvanantwerpen requested a review from aneubeck October 3, 2024 09:40

aneubeck reviewed Oct 4, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into initial-release

1c2506d

hendrikvanantwerpen and others added 8 commits October 4, 2024 17:29

Generate serialized data in build script

bae9f01

clippy

a83b250

Add README

3c95078

Bump versions

fa4edb5

Add other token sets as well

0b4cae9

Disable benchmark as a test target

d87b55e

Typos

b1e3739

Co-authored-by: Timothy Clem <[email protected]>

Merge pull request #24 from github/generate-bpe-data

61d4fd2

Generate serialized data in build script

hendrikvanantwerpen requested a review from aneubeck October 4, 2024 16:39

aneubeck reviewed Oct 7, 2024

View reviewed changes

aneubeck approved these changes Oct 7, 2024

View reviewed changes

hendrikvanantwerpen merged commit 8f53c50 into main Oct 7, 2024
3 checks passed

hendrikvanantwerpen deleted the initial-release branch October 7, 2024 11:02

hendrikvanantwerpen mentioned this pull request Oct 7, 2024

Include tiktoken difference warning #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial release #22

Initial release #22

hendrikvanantwerpen commented Oct 3, 2024 •

edited

Loading

aneubeck left a comment

aneubeck commented Oct 4, 2024

hendrikvanantwerpen commented Oct 4, 2024

hendrikvanantwerpen commented Oct 4, 2024 •

edited

Loading

hendrikvanantwerpen commented Oct 4, 2024

aneubeck Oct 7, 2024

hendrikvanantwerpen Oct 7, 2024

		@@ -0,0 +1,42 @@
		# OpenAI Byte Pair Encoders

		Fast tokenizers for OpenAI token sets based on the [bpe](https://crates.io/crates/bpe) crate.

Initial release #22

Initial release #22

Conversation

hendrikvanantwerpen commented Oct 3, 2024 • edited Loading

aneubeck left a comment

Choose a reason for hiding this comment

aneubeck commented Oct 4, 2024

hendrikvanantwerpen commented Oct 4, 2024

hendrikvanantwerpen commented Oct 4, 2024 • edited Loading

hendrikvanantwerpen commented Oct 4, 2024

aneubeck Oct 7, 2024

Choose a reason for hiding this comment

hendrikvanantwerpen Oct 7, 2024

Choose a reason for hiding this comment

hendrikvanantwerpen commented Oct 3, 2024 •

edited

Loading

hendrikvanantwerpen commented Oct 4, 2024 •

edited

Loading