Qwen Chat CUDA OutOfMemory #63

xorange · 2024-05-31T08:56:30Z

RTX 4090 24G,
Qwen-7B-Chat

loads OK:

model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

But the following causes OutOfMemoryError

# rtp_sys.conf
#
# [
#     {"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"}
# ]

import os
os.environ['MULTI_TASK_PROMPT'] = './rtp_sys.conf'
model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)


File "/data1/miniconda/xxx/rtp-llm/lib/python3.10/site-packages/maga_transformer/utils/model_weights_loader.py", line 304, in _load_layer_weight
    tensor = self._split_and_sanitize_tensor(tensor, weight).to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory.

I've tried with and without export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-Tutorial

For the record, my requirement here are:

have 2 LoRAs, and during one round chat I have to switch between them
I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"

Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)

The text was updated successfully, but these errors were encountered:

netaddi · 2024-05-31T09:47:54Z

Hi there,
Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible.
Maybe you can try using int8 quantization, which saves a lot of cuda memory.

xorange · 2024-05-31T10:16:09Z

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.

I did not even start to chat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen Chat CUDA OutOfMemory #63

Qwen Chat CUDA OutOfMemory #63

xorange commented May 31, 2024 •

edited

Loading

netaddi commented May 31, 2024

xorange commented May 31, 2024

Qwen Chat CUDA OutOfMemory #63

Qwen Chat CUDA OutOfMemory #63

Comments

xorange commented May 31, 2024 • edited Loading

netaddi commented May 31, 2024

xorange commented May 31, 2024

xorange commented May 31, 2024 •

edited

Loading