Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM. #10

NirantK · 2023-08-02T05:12:45Z

Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM.

Originally posted by @NirantK in #8 (comment)

adivik2000 · 2023-08-02T05:19:36Z

This is from https://github.com/openai/tiktoken/blob/main/tiktoken/model.py. What would you suggest?

MODEL_PREFIX_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4-": "cl100k_base",  # e.g., gpt-4-0314, etc., plus gpt-4-32k
    "gpt-3.5-turbo-": "cl100k_base",  # e.g, gpt-3.5-turbo-0301, -0401, etc.
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
}

MODEL_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
    # text
    "text-davinci-003": "p50k_base",
    "text-davinci-002": "p50k_base",
    "text-davinci-001": "r50k_base",
    "text-curie-001": "r50k_base",
    "text-babbage-001": "r50k_base",
    "text-ada-001": "r50k_base",
    "davinci": "r50k_base",
    "curie": "r50k_base",
    "babbage": "r50k_base",
    "ada": "r50k_base",
    # code
    "code-davinci-002": "p50k_base",
    "code-davinci-001": "p50k_base",
    "code-cushman-002": "p50k_base",
    "code-cushman-001": "p50k_base",
    "davinci-codex": "p50k_base",
    "cushman-codex": "p50k_base",
    # edit
    "text-davinci-edit-001": "p50k_edit",
    "code-davinci-edit-001": "p50k_edit",
    # embeddings
    "text-embedding-ada-002": "cl100k_base",
    # old embeddings
    "text-similarity-davinci-001": "r50k_base",
    "text-similarity-curie-001": "r50k_base",
    "text-similarity-babbage-001": "r50k_base",
    "text-similarity-ada-001": "r50k_base",
    "text-search-davinci-doc-001": "r50k_base",
    "text-search-curie-doc-001": "r50k_base",
    "text-search-babbage-doc-001": "r50k_base",
    "text-search-ada-doc-001": "r50k_base",
    "code-search-babbage-code-001": "r50k_base",
    "code-search-ada-code-001": "r50k_base",
    # open source
    "gpt2": "gpt2",
}

NirantK · 2023-08-04T11:20:23Z

Yes, cl100k_base is what we should use here — not gpt2 tokeniser. This is indeed different. In fact, if we can parameterise this e.g. take the model name as input, and pass to tiktoken library directly — that's the best. That way, we can support all the tokenisers that OpenAI has.

adivik2000 mentioned this issue Aug 8, 2023

gpt model addn to conversation and seperate trimmed history + formatting #13

Merged

NirantK closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM. #10

Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM. #10

NirantK commented Aug 2, 2023

adivik2000 commented Aug 2, 2023

NirantK commented Aug 4, 2023

Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM. #10

Check and confirm if the GPT4 tokeniser is same as gpt2? From what I recall, this is wrong. The tokeniser depends on the LLM. #10

Comments

NirantK commented Aug 2, 2023

adivik2000 commented Aug 2, 2023

NirantK commented Aug 4, 2023