Merge OpenAI Triton commit `f8b5301` #3069

whitneywhtsang · 2024-12-26T16:22:31Z

This PR change the Triton base from 755d416 to f8b5301 (Dec 25).
Pass rate: 99.03%

Please do not squash and merge this PR.

…5479)

This should enable python 3.13 and 3.13t wheels and disable building python 3.8 wheels. Related to: pytorch/pytorch#143654

Running the example given in the [autotune docstring](https://triton-lang.org/main/python-api/generated/triton.autotune.html) gives the error ```python import triton import torch @triton.autotune(configs=[ triton.Config(kwargs={'BLOCK_SIZE': 128}, num_warps=4), triton.Config(kwargs={'BLOCK_SIZE': 1024}, num_warps=8), ], key=['x_size'] # the two above configs will be evaluated anytime # the value of x_size changes ) @triton.jit def kernel(x_ptr, x_size, **META): BLOCK_SIZE = META['BLOCK_SIZE'] if __name__ == '__main__': x = torch.ones(8, device="cuda") kernel[lambda _: (1,)](x, x.numel()) ``` ``` Traceback (most recent call last): File "...", line 18, in <module> kernel[lambda _: (1,)](x, x.size) File ".../triton/runtime/jit.py", line 345, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File ".../triton/runtime/autotuner.py", line 156, in run timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} File ".../triton/runtime/autotuner.py", line 156, in <dictcomp> timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} File ".../triton/runtime/autotuner.py", line 133, in _bench return do_bench(kernel_call, warmup=self.num_warmups, rep=self.num_reps, quantiles=(0.5, 0.2, 0.8)) File ".../triton/testing.py", line 106, in do_bench fn() File ".../triton/runtime/autotuner.py", line 114, in kernel_call self.fn.run( File ".../triton/runtime/jit.py", line 618, in run bound_args, sig_and_spec, constexpr_vals, non_constexpr_vals, excess_kwargs = self.binder(*args, **kwargs) TypeError: dynamic_func() missing 1 required positional argument: 'META' ``` It seems it cannot parse the kwargs `**META`, so the keyword arguments must be manually specified. Also, `BLOCK_SIZE` should probably be marked as `tl.constexpr`. ```python @triton.autotune( configs=[ triton.Config(kwargs={"BLOCK_SIZE": 128}, num_warps=4), triton.Config(kwargs={"BLOCK_SIZE": 1024}, num_warps=8), ], key=["x_size"], # the two above configs will be evaluated anytime # the value of x_size changes ) @triton.jit def kernel(x_ptr, x_size, BLOCK_SIZE: tl.constexpr): ... ``` Similarly, for the [heuristics](https://triton-lang.org/main/python-api/generated/triton.heuristics.html) example, first, the same `**META` issue applies, second, `args` is no longer a list of positional argument values but a dictionary from argument name to value, and third, `2 ** int(math.ceil(math.log2(args[1])))` is awkward and `triton.next_power_of_2(args['x_size'])` should be preferred. ```python import torch import triton @triton.heuristics(values={'BLOCK_SIZE': lambda args: 2 ** int(math.ceil(math.log2(args[1])))}) @triton.jit def kernel(x_ptr, x_size, **META): BLOCK_SIZE = META['BLOCK_SIZE'] # smallest power-of-two >= x_size if __name__ == "__main__": x = torch.ones(8, device="cuda") kernel[lambda _: (1,)](x, x.numel()) ``` ``` Traceback (most recent call last): File "...", line 15, in <module> kernel[lambda _: (1,)](x, x.numel()) File ".../triton/runtime/jit.py", line 345, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File ".../triton/runtime/autotuner.py", line 337, in run kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs}) File "...", line 7, in <lambda> @triton.heuristics(values={'BLOCK_SIZE': lambda args: 2 ** int(math.ceil(math.log2(args[1])))}) KeyError: 1 ``` Applying the suggested changes results in ```python # smallest power-of-two >= x_size @triton.heuristics(values={'BLOCK_SIZE': lambda args: triton.next_power_of_2(args['x_size'])}) @triton.jit def kernel(x_ptr, x_size, BLOCK_SIZE: tl.constexpr): ... ```

Fixes #5484. Since python objects can arbitrarily override `__contains__`, using `inspect.ismodule` seems to be the most general solution, beyond numpy arrays. Overriding a module's `__contains__` would be very strange. ```python >>> import triton.language as tl >>> import inspect >>> inspect.ismodule(tl) True >>> inspect.ismodule(tl.core) True ```

I accidentally deleted a newline in triton-lang/triton#5487 which causes the [website](https://triton-lang.org/main/python-api/generated/triton.heuristics.html) to not have the code block... oops. Generated locally and can confirm the syntax is now correct. ![image](https://github.com/user-attachments/assets/0856b672-6951-4064-8c47-a0839f7b4a98) # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because this is a documentation change. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

Also create the `getThreadsPerWarpForOperand` interface. This PR doesn't completely remove the decompose method because `getThreadsPerWarpForOperand` hasn't been implemented yet for some AMD specific encodings

Signed-off-by: Whitney Tsang <[email protected]>

Jokeren and others added 9 commits December 22, 2024 21:34

[DOCS] Generate docs for tl.gather (#5478)

4f3e690

[INTERPRETER] Support and test tl.gather in the interpreter mode (#…

11ef427

…5479)

Enable python 3.13 wheels, remove python 3.8 (#5486)

f27f6a7

This should enable python 3.13 and 3.13t wheels and disable building python 3.8 wheels. Related to: pytorch/pytorch#143654

[BACKEND] Remove mma->dot decomposition from NVIDIA backend (#5488)

f8b5301

Also create the `getThreadsPerWarpForOperand` interface. This PR doesn't completely remove the decompose method because `getThreadsPerWarpForOperand` hasn't been implemented yet for some AMD specific encodings

Merge commit 'f8b5301a92459199e1b9faf7aadf1a7c10bb9866'

a3c58e3

[Intel] Add getThreadsPerWarpForOperand to DpasEncodingAttr

1133103

Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang requested review from pbchekin and anmyachev December 26, 2024 16:22

whitneywhtsang self-assigned this Dec 26, 2024

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 755d416~~ Merge OpenAI Triton commit f8b5301 Dec 26, 2024

anmyachev approved these changes Dec 26, 2024

View reviewed changes

whitneywhtsang marked this pull request as ready for review December 26, 2024 17:06

whitneywhtsang merged commit 1133103 into main Dec 26, 2024
5 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch December 26, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge OpenAI Triton commit `f8b5301` #3069

Merge OpenAI Triton commit `f8b5301` #3069

whitneywhtsang commented Dec 26, 2024

Merge OpenAI Triton commit f8b5301 #3069

Merge OpenAI Triton commit f8b5301 #3069

Conversation

whitneywhtsang commented Dec 26, 2024

Merge OpenAI Triton commit `f8b5301` #3069

Merge OpenAI Triton commit `f8b5301` #3069