Skip to content

Commit

Permalink
Update text about input splitting
Browse files Browse the repository at this point in the history
  • Loading branch information
hendrikvanantwerpen committed Oct 10, 2024
1 parent 8f16b7f commit b51951b
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,12 @@ The comparison with the Rust tiktoken implementation is more subtle, because pre

## Prior Art

There are mostly three strategies for BPE encoding.
There are mostly two strategies for BPE encoding.

1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)

Note that many tokenizers split the input into sections and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. Input splitting may is therefore not a viable strategy for improving encoding performance.

We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.

Expand Down

0 comments on commit b51951b

Please sign in to comment.