Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e test] port flash attention from sglang #3012

Open
Dewei-Wang-sh opened this issue Dec 16, 2024 · 2 comments
Open

[e2e test] port flash attention from sglang #3012

Dewei-Wang-sh opened this issue Dec 16, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request tests: e2e

Comments

@Dewei-Wang-sh
Copy link
Contributor

No description provided.

@Dewei-Wang-sh Dewei-Wang-sh self-assigned this Dec 16, 2024
@Dewei-Wang-sh Dewei-Wang-sh linked a pull request Dec 16, 2024 that will close this issue
@vlad-penkin vlad-penkin added this to the 4.2 [Performance] E2E milestone Dec 16, 2024
@vlad-penkin
Copy link
Contributor

@Dewei-Wang-sh as discussed offline please provide more details for this issue.

@vlad-penkin vlad-penkin added the enhancement New feature or request label Dec 18, 2024
@Dewei-Wang-sh
Copy link
Contributor Author

current status:
after code rewriting to block ptr and enabling a subset of the sglang flash attention, the end2end llama3-8B can run.
possible support plans :

  1. get good perf without code rewriting ( compiler do the rewriting is an option)
  2. support full set of the function in this case.
    once we have decided what to do next, we can revisit this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request tests: e2e
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants