You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenAI has pre-defined ctx_len for each of their model (for example gpt-3.5-turbo has ctx_len of 4096) and all the models are already loaded which mean you just need to use their chat/completions endpoint.
-> No settings for ctx_len because each model already has a fixed ctx_len
-> Nitro can set your own ctx_len because you load you own model
Q: What is the difference between ctx_len and max_token
max_token: Maximum number of tokens that you allow the model to generate during inferencing
Example: You max_token = 10, then if the chat is outputting more than 10 words -> truncate (and lead to truncate issue last time)
ctx_len: The upper limit of token that can be processed during inference on the backend, the relationship between ctx_len and max_token:
Q: What will happen if I input a chat that is longer than the ctx len ( max_token + chat_token > ctx_len )
In this scenario a context shift will happen, the inference will cut the extra context that is not fitted into ctx_len and keep doing inferencing normally, but it might lose some memory outside of ctx_len
Q: What value should i add as default params for max_token
Problem
We do not have clear example at the jan docs page now
Success Criteria
Simple example of one model loading case
The text was updated successfully, but these errors were encountered: