Skip to content

Commit

Permalink
fix: Handle Salamandra and OpenCoder tokenizers
Browse files Browse the repository at this point in the history
  • Loading branch information
saattrupdan authored and torymur committed Nov 19, 2024
1 parent c5db1dd commit 9ddf5e7
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions python/outlines_core/fsm/regex.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,9 +342,11 @@ def make_deterministic_fsm(fsm: FSM) -> Tuple[BetterFSM, Dict[int, int]]:

re_llama_byte_token = re.compile(r"^<0x[0-9A-F]{2}>$")

# The "▁*" prefix is required to handle Gemma and GPT-SW3 tokenizers, and the "\.*"
# suffix is required to handle the NorwAI tokenizer.
re_replacement_seq = re.compile(r"^▁*�+\.*$")
# The "▁*" prefix is required to handle Gemma and GPT-SW3 tokenizers.
# The "\.*" suffix is required to handle the NorwAI tokenizer.
# The "\.*" prefix is required to handle the Salamandra tokenizer.
# The "s*$" suffix is required to handle the OpenCoder tokenizer.
re_replacement_seq = re.compile(r"^▁*\.*�+\.*s*$")


# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode
Expand Down

0 comments on commit 9ddf5e7

Please sign in to comment.