Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm报错RuntimeError: Unsupported head size: 160 #61

Open
lichengyang666 opened this issue Aug 6, 2024 · 35 comments
Open

vllm报错RuntimeError: Unsupported head size: 160 #61

lichengyang666 opened this issue Aug 6, 2024 · 35 comments

Comments

@lichengyang666
Copy link

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢?
期待您的回复~

@lichengyang666
Copy link
Author

还有您的代码示例中似乎没有用到template_telechat.jinja 是否可以添加一下参数呢?

@wkkkkkkkm
Copy link

@lichengyang666
这是我使用的例子,你可以试一下,--chat-template加一个--chat-template参数就可以
python -m vllm.entrypoints.openai.api_server
--model TeleChat-12B
--chat-template TeleChat-12B/template_telechat.jinja
--trust-remote-code

@ge-xing
Copy link

ge-xing commented Aug 7, 2024

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

@lichengyang666
Copy link
Author

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

vllm==0.5.1+cu118
torch==2.3.0+cu118
transformers==4.43.4
xformers==0.0.26.post1+cu118
vllm-flash-attn==2.5.9
flash-attn==2.5.9.post1
奇怪的是 我用1B就不报错 12B v1会报错

@lichengyang666
Copy link
Author

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

@lichengyang666 这是我使用的例子,你可以试一下,--chat-template加一个--chat-template参数就可以 python -m vllm.entrypoints.openai.api_server --model TeleChat-12B --chat-template TeleChat-12B/template_telechat.jinja --trust-remote-code

请问这个用的是v1还是v2呢?这个参数能不能加在给出的代码示例里面呢?我加上就报错了。。。

@wkkkkkkkm
Copy link

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

请问你试过v2 有问题吗? 按理说是都支持的,v1,v2结构一样。 如果出现这个问题,请检查下你的显卡支不支持flash-attn,如果安装了flash-attn的话,应该不会报这个错误。

@lichengyang666 这是我使用的例子,你可以试一下,--chat-template加一个--chat-template参数就可以 python -m vllm.entrypoints.openai.api_server --model TeleChat-12B --chat-template TeleChat-12B/template_telechat.jinja --trust-remote-code

请问这个用的是v1还是v2呢?这个参数能不能加在给出的代码示例里面呢?我加上就报错了。。。

V1的,命令行启动

@ge-xing
Copy link

ge-xing commented Aug 7, 2024

你要不然把xformers卸载掉,因为page attn是调用的xformers里面的应该是,确实不能支持160现在,后续这块也会修复的。如果是用flash-attn就没有这个问题。

@lichengyang666
Copy link
Author

xformers卸载之后vllm会提示报错
image

@ge-xing
Copy link

ge-xing commented Aug 7, 2024

image 感觉是由于什么原因, 导致去选择了不同的backend,正常应该是flash-attn

@lichengyang666
Copy link
Author

我似乎找找到了原因
image
我跑1B的时候提示没有找到vllm_flash_attn,但是还是使用XFormers跑出来结果了,但是跑12B的时候用XFormers就报错了
但是我已经安装了下面这些依赖,请问一下是不是哪些依赖的版本有问题呢?
vllm==0.5.1+cu118
torch==2.3.0+cu118
transformers==4.43.4
xformers==0.0.26.post1+cu118
vllm-flash-attn==2.5.9
flash-attn==2.5.9.post1

@ge-xing
Copy link

ge-xing commented Aug 7, 2024

你安装了vllm-flash-attn,为啥还会提示说 not found呢 这个有点奇怪, 你是按照readme里面步骤执行的是吧 12B我们测试过很多次了,按理说不应该有这个问题呀 而且你的显卡也支持flash-attn。这个是我的环境,你可以参考一下,我里面没有flash-attn这个包,但是有vllm-flash-attn。
image

@lichengyang666
Copy link
Author

你安装了vllm-flash-attn,为啥还会提示说 not found呢 这个有点奇怪, 你是按照readme里面步骤执行的是吧 12B我们测试过很多次了,按理说不应该有这个问题呀 而且你的显卡也支持flash-attn。这个是我的环境,你可以参考一下,我里面没有flash-attn这个包,但是有vllm-flash-attn。 image

更换了您给的依赖版本,还是会出现vllm-flash-attn找不到进而使用xformer的情况,导致12B不支持160的head size,但是1B不管是用代码示例还是后台启动都是可以的

然后我又尝试升级了vllm和相关依赖 ,使用了以下版本,然后1B和12B的官方给的代码示例可以跑出结果,但是执行后台命令python -m vllm.entrypoints.openai.api_server会加载显存,然后就卡住了,半天没有反应了,没有暴露出来接口,请问有遇到过么
vllm==0.5.4+cu118
vllm-flash-attn==2.6.1+cu118
torch==2.4.0+cu118
transformers==4.44.0
xformers==0.0.27.post2
image

@ge-xing
Copy link

ge-xing commented Aug 8, 2024

这个我需要测试一下 有结果了跟你更新哈

@lichengyang666
Copy link
Author

这个我需要测试一下 有结果了跟你更新哈

好的 期待您的回复

@huyuan-cn
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd

按照这个笔记操作吧,关闭flashatt

@huyuan-cn
Copy link

请问vllm只支持12B么,还是其它参数量也支持,12B的v1和v2都支持么,还是只支持v2,目前v1报错RuntimeError: Unsupported head size: 160,是版本问题还是别的什么原因呢? 期待您的回复~

12B v1 v2 都支持 1B也支持 都已经部署验证过了。

@lichengyang666
Copy link
Author

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd

按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

@huyuan-cn
Copy link

仔细操作,是不是第一部分漏了modeling_telechat.py 的替换

@whale567
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd
按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

请问你解决了吗?我也同样操作了,还是无法关掉

@huyuan-cn
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd
按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

请问你解决了吗?我也同样操作了,还是无法关掉

是不是漏了什么步骤, 替换modeling_telechat.py,修改conf vllm版本051

@whale567
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd
按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

请问你解决了吗?我也同样操作了,还是无法关掉

是不是漏了什么步骤, 替换modeling_telechat.py,修改conf vllm版本051

image
image

我又操作了一遍,还是出现了报错

image
image
image

@huyuan-cn
Copy link

vi /root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/attention/ops/paged_attn.py

修改内容,增加160支持
def get_supported_head_sizes() -> List[int]:
return [64, 80, 96, 112, 128, 160, 256]

@huyuan-cn
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd
按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

请问你解决了吗?我也同样操作了,还是无法关掉

是不是漏了什么步骤, 替换modeling_telechat.py,修改conf vllm版本051

image image

我又操作了一遍,还是出现了报错

image image image

是不是漏了增加160head支持,这个路径找一下vllm的安装路径后,找到文件,增加160支持

@whale567
Copy link

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd
按照这个笔记操作吧,关闭flashatt

按照这个操作过一次,不知道为啥还是无法关掉flashatt

请问你解决了吗?我也同样操作了,还是无法关掉

是不是漏了什么步骤, 替换modeling_telechat.py,修改conf vllm版本051

image image
我又操作了一遍,还是出现了报错
image image image

是不是漏了增加160head支持,这个路径找一下vllm的安装路径后,找到文件,增加160支持

已经增加了,未增加会出现错误:ValueError: Head size 160 is not supported by PagedAttention. Supported head sizes are: [64, 80, 96, 112, 120, 128, 192, 256].

@jinzhangLi
Copy link

jinzhangLi commented Sep 10, 2024

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题

INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 09-10 15:36:31 selector.py:25] Using XFormers backend.
INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB
INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344
INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in
engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
engine = cls(
^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init
self.model_executor = executor_class(model_config, cache_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init
self._init_cache()
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache
self.driver_worker.warm_up_model()
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model
graph_runner.capture(
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture
self.model(
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward
model_output = self.transformer(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward
hidden_states = layer(
^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward
attn_outputs = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward
attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward
output = PagedAttention.forward_decode(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode
ops.paged_attention_v1(
RuntimeError: Unsupported head size: 160
Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so)
frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so)
frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so)
frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so)
frame #6: python() [0x525f07]

frame #9: python() [0x55477f]

@lichengyang666
Copy link
Author

可以尝试一下升级cuda版本至12.1以上,然后再按照上面提到的版本进行安装,我之前报的这些错可能是因为用的是cuda 11.8

@jinzhangLi
Copy link

可以尝试一下升级cuda版本至12.1以上,然后再按照上面提到的版本进行安装,我之前报的这些错可能是因为用的是cuda 11.8

没有啥用我是cuda12.2,照样报错,感觉是page attention的问题

@lichengyang666
Copy link
Author

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题

INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 09-10 15:36:31 selector.py:25] Using XFormers backend. INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344 INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( ^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache self.driver_worker.warm_up_model() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture self.model( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward model_output = self.transformer(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward hidden_states = layer( ^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward attn_outputs = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward output = PagedAttention.forward_decode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode ops.paged_attention_v1( RuntimeError: Unsupported head size: 160 Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #6: python() [0x525f07] frame #9: python() [0x55477f]

vllm 0.4.0 肯定不行呀,至少在0.5.1以上吧

@jinzhangLi
Copy link

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题
INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 09-10 15:36:31 selector.py:25] Using XFormers backend. INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344 INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( ^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache self.driver_worker.warm_up_model() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture self.model( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward model_output = self.transformer(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward hidden_states = layer( ^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward attn_outputs = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward output = PagedAttention.forward_decode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode ops.paged_attention_v1( RuntimeError: Unsupported head size: 160 Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #6: python() [0x525f07] frame #9: python() [0x55477f]

vllm 0.4.0 肯定不行呀,至少在0.5.1以上吧

肯定是魔改啊,你告诉我,因为什么不支持,是vllm编译后的动态链接so文件天然不支持?还是Python操作层面不支持,我是能够成功加载到现存的其实,探活计算就崩了

@jinzhangLi
Copy link

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题
INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 09-10 15:36:31 selector.py:25] Using XFormers backend. INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344 INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( ^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache self.driver_worker.warm_up_model() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture self.model( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward model_output = self.transformer(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward hidden_states = layer( ^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward attn_outputs = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward output = PagedAttention.forward_decode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode ops.paged_attention_v1( RuntimeError: Unsupported head size: 160 Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #6: python() [0x525f07] frame #9: python() [0x55477f]

vllm 0.4.0 肯定不行呀,至少在0.5.1以上吧

0.5.1版本已经成功很久了,上面那位可以直接修改vllm/attention/ops/paged_attn下的校验函数,加上160,v0.5.1其他的应该没有什么问题其实,至于chat_template可以使用jinjia文件的缓存地址进行加载,不加的话会出现template为None的计算错误,我纠结0.4.0是因为我觉得这不是版本问题,我是直接源码改的

@luyunfan
Copy link

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题
INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 09-10 15:36:31 selector.py:25] Using XFormers backend. INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344 INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( ^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache self.driver_worker.warm_up_model() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture self.model( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward model_output = self.transformer(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward hidden_states = layer( ^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward attn_outputs = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward output = PagedAttention.forward_decode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode ops.paged_attention_v1( RuntimeError: Unsupported head size: 160 Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #6: python() [0x525f07] frame #9: python() [0x55477f]

vllm 0.4.0 肯定不行呀,至少在0.5.1以上吧

0.5.1版本已经成功很久了,上面那位可以直接修改vllm/attention/ops/paged_attn下的校验函数,加上160,v0.5.1其他的应该没有什么问题其实,至于chat_template可以使用jinjia文件的缓存地址进行加载,不加的话会出现template为None的计算错误,我纠结0.4.0是因为我觉得这不是版本问题,我是直接源码改的

不止如此,cuda算子里也有校验,cuda代码也要改,然后重新编译vllm,版本不限,最新的vllm也可以。

@jinzhangLi
Copy link

我魔改了了所有,依旧不可以,应该是vllm底层的问题,我通过vllm0.4.0修改,到最后也是回到该问题
INFO 09-10 15:36:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 09-10 15:36:31 selector.py:25] Using XFormers backend. INFO 09-10 15:36:53 model_runner.py:104] Loading model weights took 23.0742 GB INFO 09-10 15:36:55 gpu_executor.py:94] # GPU blocks: 546, # CPU blocks: 344 INFO 09-10 15:36:57 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-10 15:36:57 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out ofmemory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( ^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache self.driver_worker.warm_up_model() File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 906, in capture self.model( File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 577, in forward model_output = self.transformer(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 502, in forward hidden_states = layer( ^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 448, in forward attn_outputs = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/model_executor/models/telechat_12B.py", line 388, in forward attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/layer.py", line 46, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 277, in forward output = PagedAttention.forward_decode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 116, in forward_decode ops.paged_attention_v1( RuntimeError: Unsupported head size: 160 Exception raised from paged_attention_v1_launcher at /home/runner/work/vllm/vllm/csrc/attention/attention_kernels.cu:675 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5714fc7617 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f5714f8298d in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: void paged_attention_v1_launcher<unsigned short, unsigned short, 16, false, 128>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, c10::optionalat::Tensor const&) + 0xc45 (0x7f56c4d7ee95 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #3: paged_attention_v1(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, at::Tensor&, at::Tensor&, int, int, c10::optionalat::Tensor const&, std::string const&) + 0x562 (0x7f56c4d79042 in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #4: + 0x977ec (0x7f56c4dc37ec in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #5: + 0x91e0c (0x7f56c4dbde0c in /home/user/anaconda3/envs/Ljz-dev/lib/python3.11/site-packages/vllm/_C.cpython-311-x86_64-linux-gnu.so) frame #6: python() [0x525f07] frame #9: python() [0x55477f]

vllm 0.4.0 肯定不行呀,至少在0.5.1以上吧

0.5.1版本已经成功很久了,上面那位可以直接修改vllm/attention/ops/paged_attn下的校验函数,加上160,v0.5.1其他的应该没有什么问题其实,至于chat_template可以使用jinjia文件的缓存地址进行加载,不加的话会出现template为None的计算错误,我纠结0.4.0是因为我觉得这不是版本问题,我是直接源码改的

不止如此,cuda算子里也有校验,cuda代码也要改,然后重新编译vllm,版本不限,最新的vllm也可以。

好的,所以这是编译后的动态链接文件的限制嘛,不过为何要在计算层做检验,实在奇怪

@jinzhangLi
Copy link

我发现了这个限制来源于vllm源码中的csrc.attention.attention_kernels.cu文件的738行中有一个switch(head_size)中,缺少160的计算支持,导致case 160时触发了该错误,不知道能否通过改动该代码实现跨版本支持

@luyunfan
Copy link

我发现了这个限制来源于vllm源码中的csrc.attention.attention_kernels.cu文件的738行中有一个switch(head_size)中,缺少160的计算支持,导致case 160时触发了该错误,不知道能否通过改动该代码实现跨版本支持

我发现了这个限制来源于vllm源码中的csrc.attention.attention_kernels.cu文件的738行中有一个switch(head_size)中,缺少160的计算支持,导致case 160时触发了该错误,不知道能否通过改动该代码实现跨版本支持

可以,要改两处,加上160的case,page attention有v1和v2两个核函数,都需要改,然后重新编译vllm

@hv0905
Copy link

hv0905 commented Nov 11, 2024

https://uamucg0t6qg.feishu.cn/docx/KrXtdZ8KFomTztx2twUcVxzRnZd

按照这个笔记操作吧,关闭flashatt

关掉flash attn后你是用xformers跑的吗?
我用xformers跑会报错:

  File "/home/xk/project/vllm_test/telechat-support/vllm_inf/telechat_12B.py", line 269, in forward
    attn_outputs = self.self_attention(
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xk/project/vllm_test/telechat-support/vllm_inf/telechat_12B.py", line 205, in forward
    attn_output = self.attn(query_layer, key_layer, value_layer, kv_cache, attn_metadata)
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/vllm/attention/layer.py", line 100, in forward
    return self.impl.forward(query,
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/vllm/attention/backends/xformers.py", line 648, in forward
    output[num_prefill_tokens:] = PagedAttention.forward_decode(
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/vllm/attention/ops/paged_attn.py", line 131, in forward_decode
    ops.paged_attention_v1(
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/vllm/_custom_ops.py", line 115, in paged_attention_v1
    torch.ops._C.paged_attention_v1(
  File "/home/xk/anaconda3/envs/vllm-env/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: Unsupported head size: 160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants