Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen Chat CUDA OutOfMemory #63

Open
xorange opened this issue May 31, 2024 · 2 comments
Open

Qwen Chat CUDA OutOfMemory #63

xorange opened this issue May 31, 2024 · 2 comments

Comments

@xorange
Copy link

xorange commented May 31, 2024

RTX 4090 24G,
Qwen-7B-Chat

loads OK:

model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

But the following causes OutOfMemoryError

# rtp_sys.conf
#
# [
#     {"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"}
# ]

import os
os.environ['MULTI_TASK_PROMPT'] = './rtp_sys.conf'
model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)


File "/data1/miniconda/xxx/rtp-llm/lib/python3.10/site-packages/maga_transformer/utils/model_weights_loader.py", line 304, in _load_layer_weight
    tensor = self._split_and_sanitize_tensor(tensor, weight).to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory.

I've tried with and without export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-Tutorial

For the record, my requirement here are:

  1. have 2 LoRAs, and during one round chat I have to switch between them
  2. I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"

Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)

@netaddi
Copy link
Collaborator

netaddi commented May 31, 2024

Hi there,
Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible.
Maybe you can try using int8 quantization, which saves a lot of cuda memory.

@xorange
Copy link
Author

xorange commented May 31, 2024

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.

I did not even start to chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants