Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ with Marlin kernel erroring out while loading the model in DJL 0.29 with vllm #2486

Open
guptaanshul201989 opened this issue Oct 24, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@guptaanshul201989
Copy link

Description

I am trying to host a quantized Mistral Instruct v0.2 model. I am using AWQ+Marlin for quantization.

After quantization, I can run the model successfully using transformers+autoawq. However, when I try to host the model via DJL 0.29 + vllm, I encounter an error.

Expected Behavior

Should not error out

Error Message

WARN  PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
INFO  PyProcess W-19109-model-stdout: Failed invoke service.invoke_handler()
WARN  PyProcess W-19109-model-stderr: 
WARN  PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
INFO  PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 161, in run_server
INFO  PyProcess W-19109-model-stdout:     outputs = self.service.invoke_handler(function_name, inputs)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO  PyProcess W-19109-model-stdout:     return getattr(self.module, function_name)(inputs)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 538, in handle
INFO  PyProcess W-19109-model-stdout:     _service.initialize(inputs.get_properties())
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 135, in initialize
INFO  PyProcess W-19109-model-stdout:     self.rolling_batch = _rolling_batch_cls(
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/vllm_rolling_batch.py", line 48, in __init__
INFO  PyProcess W-19109-model-stdout:     self.engine = LLMEngine.from_engine_args(args)
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
INFO  PyProcess W-19109-model-stdout:     engine = cls(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
INFO  PyProcess W-19109-model-stdout:     self.model_executor = executor_class(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
INFO  PyProcess W-19109-model-stdout:     self._init_executor()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
INFO  PyProcess W-19109-model-stdout:     self.driver_worker.load_model()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
INFO  PyProcess W-19109-model-stdout:     self.model_runner.load_model()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
INFO  PyProcess W-19109-model-stdout:     self.model = get_model(model_config=self.model_config,
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
INFO  PyProcess W-19109-model-stdout:     return loader.load_model(model_config=model_config,
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 283, in load_model
INFO  PyProcess W-19109-model-stdout:     model.load_weights(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 511, in load_weights
INFO  PyProcess W-19109-model-stdout:     weight_loader(param, loaded_weight)
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 758, in weight_loader
INFO  PyProcess W-19109-model-stdout:     loaded_weight = loaded_weight.narrow(input_dim, start_idx,
INFO  PyProcess W-19109-model-stdout: RuntimeError: start (0) + length (14336) exceeds dimension size (896).
INFO  PyProcess Stop process: -1:19109, failure=false
INFO  PyProcess W-19109-model-stdout: Python engine process died
INFO  PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 207, in main
INFO  PyProcess W-19109-model-stdout:     engine.run_server()
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 125, in run_server
INFO  PyProcess W-19109-model-stdout:     inputs.read(cl_socket)
INFO  PyProcess Stop process: -1:19109, failure=true
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 221, in read
INFO  PyProcess Failure count: 0
INFO  PyProcess W-19109-model-stdout:     prop_size = retrieve_short(conn)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 60, in retrieve_short
INFO  PyProcess W-19109-model-stdout:     data = retrieve_buffer(conn, 2)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 36, in retrieve_buffer
INFO  PyProcess W-19109-model-stdout:     raise ValueError("Connection disconnected")
INFO  PyProcess W-19109-model-stdout: ValueError: Connection disconnected
INFO  PyProcess ReaderThread(0) stopped - W-19109-model-stdout

How to Reproduce?

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Quantization of the MIstral Instruct v0.2
    tokenizer = AutoTokenizer.from_pretrained(<local_path_to mistral_instruct>)
    quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

    quantized_model = AutoAWQForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, use_cache=False)

    quantized_model.quantize(tokenizer, quant_config=quant_config)

    quantized_model.save_quantized(<output_path>)
    tokenizer.save_pretrained(<output_path>)

  1. Try hosting model using DJL 0.29 +vllm
    serving config
option.model_id=<s3 path to quantized model>
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.max_model_len=4096
option.enable_prefix_caching=true
option.max_rolling_batch_size=4
option.dtype=fp16
load_on_device=*
gpu.minWorkers=3
gpu.maxWorkers=3
option.gpu_memory_utilization=0.3

What have you tried to solve it?

1.I tried providing different quant_method variables, thinking that it might be a configuration mismatch issue, but that wasn't the case. In fact, without specifying any options.quantize, vllm correctly detected the method and version. However, it still resulted in the error mentioned above.

@guptaanshul201989 guptaanshul201989 added the bug Something isn't working label Oct 24, 2024
@siddvenk
Copy link
Contributor

I am able to reproduce this issue with DJL 0.29.0 (vllm 0.5.3.post1) and DJL 0.30.0 (vllm 0.6.2). I am also able to reproduce this issue with vllm directly, as you pointed out.

This is definitely a vllm issue, and until they fix it, it will be present in DJL. While not the same model, I did see this issue in vllm vllm-project/vllm#3392. It's marked closed, but there are folks still reporting this issue (on vllm 0.6.3). I'll see if i can get traction from vllm here

@siddvenk
Copy link
Contributor

siddvenk commented Oct 31, 2024

It does seem like vLLM supports converting a regular AWQ model to marlin format within vllm, but doesn't support a marlin format being directly supplied to vllm. See vllm-project/vllm#7517. Unfortunately this really seems like a vllm issue, so until it's fixed there is not much we can do.

Are you able to quantize with AWQ (without marlin), and then use vllm which will apply marlin at runtime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants