[e2e test] port flash attention from sglang #3012

Dewei-Wang-sh · 2024-12-16T01:51:43Z

No description provided.

vlad-penkin · 2024-12-18T14:40:56Z

@Dewei-Wang-sh as discussed offline please provide more details for this issue.

Dewei-Wang-sh · 2024-12-19T03:18:34Z

current status:
after code rewriting to block ptr and enabling a subset of the sglang flash attention, the end2end llama3-8B can run.
possible support plans :

get good perf without code rewriting ( compiler do the rewriting is an option)
support full set of the function in this case.
once we have decided what to do next, we can revisit this issue

Dewei-Wang-sh self-assigned this Dec 16, 2024

Dewei-Wang-sh linked a pull request Dec 16, 2024 that will close this issue

[Test] port flash attention from sglang #3011

Closed

vlad-penkin added this to the 4.2 [Performance] E2E milestone Dec 16, 2024

vlad-penkin added the tests: e2e label Dec 18, 2024

vlad-penkin added the enhancement New feature or request label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[e2e test] port flash attention from sglang #3012

[e2e test] port flash attention from sglang #3012

Dewei-Wang-sh commented Dec 16, 2024

vlad-penkin commented Dec 18, 2024

Dewei-Wang-sh commented Dec 19, 2024

[e2e test] port flash attention from sglang #3012

[e2e test] port flash attention from sglang #3012

Comments

Dewei-Wang-sh commented Dec 16, 2024

vlad-penkin commented Dec 18, 2024

Dewei-Wang-sh commented Dec 19, 2024