Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Is there any plan to provide python wrapper of the cuda kernels? #1

Open
PannenetsF opened this issue Sep 5, 2023 · 1 comment

Comments

@PannenetsF
Copy link

Hi, the kernels are awesome to support prefill-generate at the same round and it is predictable to have a better performance.

However, as most inference/serving frameworks are Python-based, the cpp-only architecture prevents the project from further application. So is there any plan to wrap it with pybind11 so that the kernel can be used in PyTorch?

@ZhangZhiPku
Copy link

ZhangZhiPku commented Sep 22, 2023

十一的时候我可能有空写写这个...

但请注意,并非是单纯引用这些 kernel 就可以提升系统性能,部分 kernel 如 rmsnorm 等容易完成即插即用的适配工作,并带来一定性能增益。但 decoding attention 算子牵扯到 dynamic batching 优化与显存管理,它的输入形如 [1, seqlen, hidden_dim],并且需要给定可用的 kv 缓存空间。在调用该算子之前你可能必须手动修改整个网络结构与显存管理机制。int8 相关 kernel 牵扯图融合与输入 layout 变换,并且需要对权重进行 fp16 -> int8 的预处理。同样需要修改整个网络结构才能起到加速作用。

以 python 调用 cuda 函数时,你可能还会受到 python 自身性能的影响,而无法充分利用高性能算子的优势,这一现象在batch=1时尤为明显,以同样的 kernel 运行 7b 模型,以 python 驱动 cuda 完成调用时,系统延迟大约为17-18ms,PPL将做到12ms以内。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants