Skip to content

Commit

Permalink
SmolLM2: fix different bos_token depending on base or instruct
Browse files Browse the repository at this point in the history
  • Loading branch information
ysjprojects committed Dec 3, 2024
1 parent 6dd3353 commit a46e3f8
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion litgpt/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,9 @@ def check_if_bos_token_used(self, checkpoint_dir: Path) -> bool:
config = json.load(fp)
# for LlaMA-3 tokenizer there is no `add_bos_token` at all and `tokenizer_class` is only
# `PreTrainedTokenizerFast`
if checkpoint_dir.stem.startswith(("Meta-Llama-3", "Llama-3", "SmolLM2")):
if checkpoint_dir.stem.startswith(("Meta-Llama-3", "Llama-3")):
return True
if checkpoint_dir.stem.startswith("SmolLM2") and checkpoint_dir.stem.endswith("-Instruct"):
return True
if "add_bos_token" in config:
return config["add_bos_token"]
Expand Down

0 comments on commit a46e3f8

Please sign in to comment.