You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
have 2 LoRAs, and during one round chat I have to switch between them
I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"
Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)
The text was updated successfully, but these errors were encountered:
Hi there,
Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible.
Maybe you can try using int8 quantization, which saves a lot of cuda memory.
Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.
I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.
RTX 4090 24G,
Qwen-7B-Chat
loads OK:
But the following causes OutOfMemoryError
I've tried with and without
export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-Tutorial
For the record, my requirement here are:
Because of requirement 1,
python3 -m maga_transformer.start_server
+ http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)The text was updated successfully, but these errors were encountered: