Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore_merges option to BPE tokenizers #716

Merged
merged 1 commit into from
Apr 18, 2024
Merged

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Apr 17, 2024

Adds a check before using merges, returning the token if it is part of the vocab.
Equivalent to huggingface/tokenizers#1493

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 6d5901e into main Apr 18, 2024
4 checks passed
@xenova xenova deleted the tokenizers-ignore-merges branch April 18, 2024 11:59
Th3G33k added a commit to Th3G33k/transformers.js that referenced this pull request May 11, 2024
* Add `ignore_merges` option to BPE tokenizers (huggingface#716)

* [version] Update to 2.17.1

* Use ungated version of mistral tokenizer (huggingface#718)

* Add mobilevitv2 (huggingface#721)

* Add support for MobileViTV2

* Update supported_models.py

* Add support for `do_flip_channel_order`

* Add unit test for `do_flip_channel_order=true`

* docs: update vanilla-js.md (huggingface#738)

minor fix

* Support reading data from blob URI (huggingface#645)

* Make blob as valid URL

* Create function to detect the blob URI

* Change to `isValidUrl`

* Remove comment

Co-authored-by: Joshua Lochner <[email protected]>

* Merge `isValidHttpUrl` into `isValidUrl`

* Correct implement

* Update docs

* Add test

* Remove export for `isValidUrl`

* Test read blob via `getFile`

* Use `res.text()` instead `res.body`

---------

Co-authored-by: Joshua Lochner <[email protected]>

* Add aggregation_strategy + start end tokens

* Beautify code

* tokenizers return_offsets_mapping

* QuestionAnswering start end char

---------

Co-authored-by: Joshua Lochner <[email protected]>
Co-authored-by: Ikko Eltociear Ashimine <[email protected]>
Co-authored-by: Hans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants