add triton mask kernel implementations #100

yzh119 · 2024-11-26T23:35:53Z

This PR adds triton implementation of the mask kernels because triton is easier and more friendly to maintain.

This is just a proof of concept, and I haven't tuned performance yet, leave it for future work.

Ubospica · 2024-11-27T22:38:48Z

Performance results (RTX 4090, AMD 7950X):
After
________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________
Time taken: 7.287452928721905 us
_______________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________
Time taken: 7.149564567953348 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] ________________________
Time taken: 79.60640639066696 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] _______________________
Time taken: 79.6700045466423 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] ________________________
Time taken: 36.29679977893829 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] _______________________
Time taken: 36.520082503557205 us

Before
________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________
Time taken: 5.856346804648638 us
_______________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________
Time taken: 6.128270644694567 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] ________________________
Time taken: 21.356193348765373 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] _______________________
Time taken: 62.028028070926666 us
_______________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] ________________________
Time taken: 18.695371225476265 us
______________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] _______________________
Time taken: 31.38265386223793 us

Ubospica · 2024-11-27T22:39:47Z

We can merge it for now and enhance the performance later. The triton kernel significantly reduces our effort to be compatible with various Cuda versions.

Ubospica · 2024-11-27T22:40:07Z

Thanks @yzh119 !

yzh119 · 2024-11-27T23:18:27Z

As we discussed earlier, one optimization is to remove the tl.load(logits_ptr ..., but only store inf to certain positions:

        tl.store(logits_ptr + batch_id * vocab_size + offsets, -float("inf"), vocab_mask & bitmask)

yzh119 added 4 commits November 26, 2024 15:32

upd

5b2cf95

format

cf221db

remove unused args

fa6bbb5

persistent

7461014

Ubospica merged commit 6e73e1d into mlc-ai:main Nov 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add triton mask kernel implementations #100

add triton mask kernel implementations #100

yzh119 commented Nov 26, 2024

Ubospica commented Nov 27, 2024 •

edited

Loading

Ubospica commented Nov 27, 2024

Ubospica commented Nov 27, 2024

yzh119 commented Nov 27, 2024 •

edited

Loading

add triton mask kernel implementations #100

add triton mask kernel implementations #100

Conversation

yzh119 commented Nov 26, 2024

Ubospica commented Nov 27, 2024 • edited Loading

Ubospica commented Nov 27, 2024

Ubospica commented Nov 27, 2024

yzh119 commented Nov 27, 2024 • edited Loading

Ubospica commented Nov 27, 2024 •

edited

Loading

yzh119 commented Nov 27, 2024 •

edited

Loading