You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to host a quantized Mistral Instruct v0.2 model. I am using AWQ+Marlin for quantization.
After quantization, I can run the model successfully using transformers+autoawq. However, when I try to host the model via DJL 0.29 + vllm, I encounter an error.
Expected Behavior
Should not error out
Error Message
WARN PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
INFO PyProcess W-19109-model-stdout: Failed invoke service.invoke_handler()
WARN PyProcess W-19109-model-stderr:
WARN PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
INFO PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 161, in run_server
INFO PyProcess W-19109-model-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-19109-model-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 538, in handle
INFO PyProcess W-19109-model-stdout: _service.initialize(inputs.get_properties())
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 135, in initialize
INFO PyProcess W-19109-model-stdout: self.rolling_batch = _rolling_batch_cls(
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/vllm_rolling_batch.py", line 48, in __init__
INFO PyProcess W-19109-model-stdout: self.engine = LLMEngine.from_engine_args(args)
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
INFO PyProcess W-19109-model-stdout: engine = cls(
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
INFO PyProcess W-19109-model-stdout: self.model_executor = executor_class(
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
INFO PyProcess W-19109-model-stdout: self._init_executor()
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
INFO PyProcess W-19109-model-stdout: self.driver_worker.load_model()
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
INFO PyProcess W-19109-model-stdout: self.model_runner.load_model()
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
INFO PyProcess W-19109-model-stdout: self.model = get_model(model_config=self.model_config,
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
INFO PyProcess W-19109-model-stdout: return loader.load_model(model_config=model_config,
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 283, in load_model
INFO PyProcess W-19109-model-stdout: model.load_weights(
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 511, in load_weights
INFO PyProcess W-19109-model-stdout: weight_loader(param, loaded_weight)
INFO PyProcess W-19109-model-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 758, in weight_loader
INFO PyProcess W-19109-model-stdout: loaded_weight = loaded_weight.narrow(input_dim, start_idx,
INFO PyProcess W-19109-model-stdout: RuntimeError: start (0) + length (14336) exceeds dimension size (896).
INFO PyProcess Stop process: -1:19109, failure=false
INFO PyProcess W-19109-model-stdout: Python engine process died
INFO PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 207, in main
INFO PyProcess W-19109-model-stdout: engine.run_server()
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 125, in run_server
INFO PyProcess W-19109-model-stdout: inputs.read(cl_socket)
INFO PyProcess Stop process: -1:19109, failure=true
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 221, in read
INFO PyProcess Failure count: 0
INFO PyProcess W-19109-model-stdout: prop_size = retrieve_short(conn)
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 60, in retrieve_short
INFO PyProcess W-19109-model-stdout: data = retrieve_buffer(conn, 2)
INFO PyProcess W-19109-model-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 36, in retrieve_buffer
INFO PyProcess W-19109-model-stdout: raise ValueError("Connection disconnected")
INFO PyProcess W-19109-model-stdout: ValueError: Connection disconnected
INFO PyProcess ReaderThread(0) stopped - W-19109-model-stdout
How to Reproduce?
Steps to reproduce
(Paste the commands you ran that produced the error.)
1.I tried providing different quant_method variables, thinking that it might be a configuration mismatch issue, but that wasn't the case. In fact, without specifying any options.quantize, vllm correctly detected the method and version. However, it still resulted in the error mentioned above.
The text was updated successfully, but these errors were encountered:
I am able to reproduce this issue with DJL 0.29.0 (vllm 0.5.3.post1) and DJL 0.30.0 (vllm 0.6.2). I am also able to reproduce this issue with vllm directly, as you pointed out.
This is definitely a vllm issue, and until they fix it, it will be present in DJL. While not the same model, I did see this issue in vllm vllm-project/vllm#3392. It's marked closed, but there are folks still reporting this issue (on vllm 0.6.3). I'll see if i can get traction from vllm here
It does seem like vLLM supports converting a regular AWQ model to marlin format within vllm, but doesn't support a marlin format being directly supplied to vllm. See vllm-project/vllm#7517. Unfortunately this really seems like a vllm issue, so until it's fixed there is not much we can do.
Are you able to quantize with AWQ (without marlin), and then use vllm which will apply marlin at runtime?
Description
I am trying to host a quantized Mistral Instruct v0.2 model. I am using AWQ+Marlin for quantization.
After quantization, I can run the model successfully using transformers+autoawq. However, when I try to host the model via DJL 0.29 + vllm, I encounter an error.
Expected Behavior
Should not error out
Error Message
How to Reproduce?
Steps to reproduce
(Paste the commands you ran that produced the error.)
serving config
What have you tried to solve it?
1.I tried providing different quant_method variables, thinking that it might be a configuration mismatch issue, but that wasn't the case. In fact, without specifying any options.quantize, vllm correctly detected the method and version. However, it still resulted in the error mentioned above.
The text was updated successfully, but these errors were encountered: