Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add triton mask kernel implementations #100

Merged
merged 4 commits into from
Nov 27, 2024
Merged

add triton mask kernel implementations #100

merged 4 commits into from
Nov 27, 2024

Conversation

yzh119
Copy link
Member

@yzh119 yzh119 commented Nov 26, 2024

This PR adds triton implementation of the mask kernels because triton is easier and more friendly to maintain.

This is just a proof of concept, and I haven't tuned performance yet, leave it for future work.

cc @Ubospica @MasterJH5574

@Ubospica
Copy link
Collaborator

Ubospica commented Nov 27, 2024

Performance results (RTX 4090, AMD 7950X):
After
________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________
Time taken: 7.287452928721905 us
_______________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________
Time taken: 7.149564567953348 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] ________________________
Time taken: 79.60640639066696 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] _______________________
Time taken: 79.6700045466423 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] ________________________
Time taken: 36.29679977893829 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] _______________________
Time taken: 36.520082503557205 us

Before
________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________
Time taken: 5.856346804648638 us
_______________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________
Time taken: 6.128270644694567 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] ________________________
Time taken: 21.356193348765373 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] _______________________
Time taken: 62.028028070926666 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] ________________________
Time taken: 18.695371225476265 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] _______________________
Time taken: 31.38265386223793 us

@Ubospica
Copy link
Collaborator

We can merge it for now and enhance the performance later. The triton kernel significantly reduces our effort to be compatible with various Cuda versions.

@Ubospica Ubospica merged commit 6e73e1d into mlc-ai:main Nov 27, 2024
1 check passed
@Ubospica
Copy link
Collaborator

Thanks @yzh119 !

@yzh119
Copy link
Member Author

yzh119 commented Nov 27, 2024

As we discussed earlier, one optimization is to remove the tl.load(logits_ptr ..., but only store inf to certain positions:

        tl.store(logits_ptr + batch_id * vocab_size + offsets, -float("inf"), vocab_mask & bitmask)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants