Skip to content

Commit

Permalink
Add CONTRIBUTING.md for bpe explaining project structure and benchmar…
Browse files Browse the repository at this point in the history
…k instructions
  • Loading branch information
hendrikvanantwerpen committed Oct 14, 2024
1 parent 02118ef commit ed45357
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 23 deletions.
39 changes: 39 additions & 0 deletions crates/bpe/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Contributing

Here are specific details that are useful when you want to contribute to the BPE crates.
Make sure to read the repository's [contribution guidelines][contributing] as well.

## Project structure

This project has a slightly unusual structure to resolve some dependency issues.

- This directory contains `bpe`, the BPE code itself.
- A sibling directory contains `bpe-openai`, which exposes tokenizers for OpenAI token sets, and depends on `bpe`.
- Tests are located in the `tests` subdirectory, and benchmarks in the `benchmarks` subdirectory. Both of these are separate crates so they can depend on `bpe-openai` without causing a cyclic dependency.

Only the `bpe` and `bpe-openai` crates are meant to be published. The other ones are for development use only.

## Running benchmarks

Change the working directory to the `benchmarks` directory:

```sh
cd benchmarks
```

Run the benchmark as follows (required [cargo-criterion](https://crates.io/crates/cargo-criterion) installed):

```sh
cargo criterion
```

(Using `cargo bench` ignores the settings in `criterion.toml`!)
Open the full report which should be located in `target/criterion/reports/index.html`.

Update the figures in this repo as follows (requires `rsvg-convert` from `librsvg` installed):

```sh
script/copy-results
```

[contributing]: ../../CONTRIBUTING.md
23 changes: 0 additions & 23 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,26 +296,3 @@ The performance of tiktoken shows a quadratic growth with the input size.
The Huggingface encoder scales better, but becomes slower and slower compared to our implementation as input size increases.

![worst-case encoding runtime comparison](./images/performance-worstcase.svg)

### Running the benchmarks

Benchmarks are located in a separate crate in the `benchmarks` directory.

```sh
cd benchmarks
```

Run the benchmark as follows (required [cargo-criterion](https://crates.io/crates/cargo-criterion) installed):

```sh
cargo criterion
```

(Using `cargo bench` ignores the settings in `criterion.toml`!)
Open the full report which should be located in `target/criterion/reports/index.html`.

Update the figures in this repo as follows (requires `rsvg-convert` from `librsvg` installed):

```sh
script/copy-results
```

0 comments on commit ed45357

Please sign in to comment.