diff --git a/.github/README.md b/.github/README.md new file mode 100644 index 0000000000..0fc64b2764 --- /dev/null +++ b/.github/README.md @@ -0,0 +1,324 @@ +[![Build and test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml) +[![Triton wheels](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml) + +# Intel® XPU Backend for Triton\* + +This is the development repository of Intel® XPU Backend for Triton\*, a new [Triton](https://github.com/triton-lang/triton) backend for Intel GPUs. +Intel® XPU Backend for Triton\* is a out of tree backend module for [Triton](https://github.com/triton-lang/triton) used to provide best-in-class performance and productivity on any Intel GPUs for [PyTorch](https://github.com/pytorch/pytorch) and standalone usage. + +# Compatibility + +* Operating systems: + * [Ubuntu 22.04](http://releases.ubuntu.com/22.04) +* GPU Cards: + * [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) + * [Intel® Data Center Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html) + * [Intel Arc A770](https://www.intel.com/content/www/us/en/products/sku/229151/intel-arc-a770-graphics-16gb/specifications.html) +* GPU Drivers: + * Latest [Long Term Support (LTS) Release](https://dgpu-docs.intel.com/driver/installation.html) + * Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) +* Toolchain: + * Latest [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) + +Note that Intel® XPU Backend for Triton\* is not compatible with Intel® Extension for PyTorch\* and Intel® oneAPI Base Toolkit\*. + +# Quick Installation + +## Prerequisites + +1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver +2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) +3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) + +## Install PyTorch and Triton from nightly wheels + +Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both can be installed from nightly wheels. +Navigate to the [nightly wheels workflow](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml), +select the most recent successful run on the top of the page and download an artifact for the corresponding Python version. +Extract the archive and in the extracted directory execute: + +```shell +pip install torch-*.whl triton-*.whl +``` + +Before using Intel® XPU Backend for Triton\* you need to initialize the toolchain. +The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user). + +```shell +# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs +source /opt/intel/oneapi/setvars.sh +``` + +# Install from source + +## Prerequisites + +1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver +2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) +3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) + +## Compile PyTorch and Triton from source + +Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both need to be compiled at the same time. + +Before compiling PyTorch and Intel® XPU Backend for Triton\* you need to initialize the toolchain. +The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user). + +```shell +# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs +source /opt/intel/oneapi/setvars.sh +``` + +Clone this repository: + +```shell +git clone https://github.com/intel/intel-xpu-backend-for-triton.git +cd intel-xpu-backend-for-triton +``` + +To avoid potential conflicts with installed packages it is recommended to create and activate a new Python virtual environment: + +```shell +python -m venv .venv --prompt triton +source .venv/bin/activate +``` + +Compile and install PyTorch: + +```shell +scripts/install-pytorch.sh --source +``` + +Compile and install Intel® XPU Backend for Triton\*: + +```shell +scripts/compile-triton.sh +``` + +# Building with a custom LLVM + +Triton uses LLVM to generate code for GPUs and CPUs. Normally, the Triton build +downloads a prebuilt LLVM, but you can also build LLVM from source and use that. + +LLVM does not have a stable API, so the Triton build will not work at an +arbitrary LLVM version. + +1. Find the version of LLVM that Triton builds against. +Check `cmake/llvm-hash.txt` to see the current version. + +2. Checkout LLVM at this revision to the directory `llvm`, +which must be in the same directory as `intel-xpu-backend-for-triton`: + +3. In the directory `intel-xpu-backend-for-triton`, build Triton with custom LLVM: + + ```shell + ./scripts/compile-triton.sh --llvm --triton + ``` + +# Tips for building + +- Set `TRITON_BUILD_WITH_CLANG_LLD=true` as an environment variable to use clang + and lld. lld in particular results in faster builds. + +- Set `TRITON_BUILD_WITH_CCACHE=true` to build with ccache. + +- Set `TRITON_HOME=/some/path` to change the location of the `.triton` + directory where Triton's cache is located and downloads are stored + during the build. By default, this is the user's home directory. It + can be changed anytime. + +- Pass `--no-build-isolation` to `pip install` to make nop builds faster. + Without this, every invocation of `pip install` uses a different symlink to + cmake, and this forces ninja to rebuild most of the `.a` files. + +- VSCcode IntelliSense has some difficulty figuring out how to build Triton's C++ + (probably because, in our build, users don't invoke cmake directly, but + instead use setup.py). Teach vscode how to compile Triton as follows. + + - Do a local build. Run command `pip install -e python` + - Get the full path to the `compile_commands.json` file produced by the build: + `find python/build -name 'compile_commands.json' | xargs readlink -f`. + You might get a full path similar to `/Users/{username}/triton/python/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json` + - In vscode, install the + [C/C++ + extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools), + then open the command palette (`Shift + Command + P` on Mac, or `Shift + + Ctrl + P` on Windows/Linux) and open `C/C++: Edit Configurations (UI)`. + - Open "Advanced Settings" and paste the full path to + `compile_commands.json` into the "Compile Commands" textbox. + +# Running tests + +There currently isn't a turnkey way to run all the Triton tests, but you can +follow the following recipe. + +```shell +scripts/test-triton.sh +``` + +# Tips for hacking + +For detailed instructions on how to debug Triton's frontend, please refer to this [tutorial](https://triton-lang.org/main/programming-guide/chapter-3/debugging.html). The following includes additional tips for hacking on Triton's backend. + +**Helpful environment variables** + +- `MLIR_ENABLE_DUMP=1` dumps the IR before every MLIR pass Triton runs, for all + kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only. + - Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*` +- `LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR. +- `TRITON_INTERPRET=1` uses the Triton interpreter instead of running on the + GPU. You can insert Python breakpoints in your kernel code! +- `TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of + debugging information to stdout. If this is too noisy, run with just + `TRITON_LLVM_DEBUG_ONLY` instead to limit the output. + + An alternative way to reduce output noisiness is running with + `LLVM_IR_ENABLE_DUMP=1`, extract the IR before the LLVM pass of interest, and + then run LLVM's `opt` standalone, perhaps passing `-debug-only=foo` on the + command line. +- `TRITON_LLVM_DEBUG_ONLY=` is the equivalent of LLVM's + `-debug-only` command-line option. This limits the LLVM debug output to + specific pass or component names (which are specified using `#define + DEBUG_TYPE` throughout LLVM and Triton) in order to allow the debug output to + be less noisy. `TRITON_LLVM_DEBUG_ONLY` allows for one or more comma + separated values to be specified (eg + `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions` or + `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc"`). +- `USE_IR_LOC={ttir,ttgir}` reparses the IR such that the location information + will be the line number of the IR file with that particular extension, + instead of line number of the python file. This can provide a direct mapping + from the IR to llir/ptx. When used with performance tools, it can provide a + breakdown on IR instructions. +- `TRITON_PRINT_AUTOTUNING=1` prints out the best autotuning config and total time + spent for each kernel after autotuning is complete. +- `DISABLE_LLVM_OPT` will disable llvm optimizations for make_llir and make_ptx + if its value is true when parsing as Bool. Otherwise, it will be parsed as a list + of flags to disable llvm optimizations. One usage case is + `DISABLE_LLVM_OPT="disable-lsr"` + Loop strength reduction is known to cause up to 10% performance changes for + certain kernels with register pressure. +- `TRITON_ALWAYS_COMPILE=1` forces to compile kernels regardless of cache hit. +- `MLIR_ENABLE_TIMING` dumps the timing information for each MLIR pass. +- `LLVM_ENABLE_TIMING` dumps the timing information for each LLVM pass. +- `TRITON_DEFAULT_FP_FUSION` overrides the default behavior of allowing fp fusion (mul+add->fma). +- `MLIR_ENABLE_REMARK` enables the performance warnings that are emitted as remarks. + +# Usage Guide + +## Code Modifications +Intel® XPU Backend for Triton\* requires a special version of PyTorch that can be built from sources or installed from nightly wheels. + +1. Add `import torch` for xpu support. +2. Put the tensor and models to XPU by calling `to('xpu')`. + +This repository contains modified [tutorials](https://github.com/intel/intel-xpu-backend-for-triton/tree/main/python/tutorials) that must be used with Intel® XPU Backend for Triton\*. + +The following examples show modifications for the user code. + +### Example 1 : Triton Kernel + +This example is a modified version of [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) triton kernel. Please refer to [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) for detailed comments and illustration about the code semantics. + +Comparing to the original code, the following code modifies: + +```Python +import torch +import triton +import triton.language as tl + + +@triton.jit +def add_kernel( + x_ptr, + y_ptr, + output_ptr, + n_elements, + BLOCK_SIZE: tl.constexpr, +): + pid = tl.program_id(axis=0) + block_start = pid * BLOCK_SIZE + offsets = block_start + tl.arange(0, BLOCK_SIZE) + mask = offsets < n_elements + x = tl.load(x_ptr + offsets, mask=mask) + y = tl.load(y_ptr + offsets, mask=mask) + output = x + y + tl.store(output_ptr + offsets, output, mask=mask) + +def add(x: torch.Tensor, y: torch.Tensor): + # Put the tensor to xpu + output = torch.empty_like(x).xpu() + assert x.is_xpu and y.is_xpu and output.is_xpu + n_elements = output.numel() + grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),) + add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024) + + return output + +# For manual_seed, needs to use API for XPU +torch.xpu.manual_seed(0) +size = 512 +# For tensors, needs to be put on XPU +x = torch.rand(size, device='xpu') +y = torch.rand(size, device='xpu') +output_torch = x + y +output_triton = add(x, y) +print(output_torch) +print(output_triton) +print( + f'The maximum difference between torch and triton is ' + f'{torch.max(torch.abs(output_torch - output_triton))}' +) +``` + +### Example 2 : End-to-End Model +Triton is transparent for end-to-end models. One could easily use `torch.compile` with `inductor` as backend by default. It will automatically generates triton kernel and gets benefit from it. + +```Python +import torch +from torch._dynamo.testing import rand_strided + +from torch.nn import * +class simpleModel(torch.nn.Module): + def __init__(self): + super().__init__() + # tensors inside model should be on xpu + self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32) + + def forward(self, x): + z = x + self.y + return z + +# tensors passed to the model should be on xpu +x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32) +xpu_model = simpleModel() +# Call torch.compile for optimization +optimized_mod = torch.compile(xpu_model) + +graph_result = optimized_mod(x) +``` + +## Performance Analysis Guide + +There are several ways of doing performance analysis. +We recommend using `torch.profiler` for end-to-end performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis. +Note that the user needs to explicitly set `TRITON_XPU_PROFILE=1` when the user needs to enable kernel profiling. + +```Bash +export TRITON_XPU_PROFILE=1 +``` + +# Contributing + +Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/intel/intel-xpu-backend-for-triton). For more detailed instructions, please visit our [contributor's guide](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/CONTRIBUTING.md). + +## License + +_MIT License_. As found in [LICENSE](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/LICENSE) file. + + +## Security + +See Intel's [Security Center](https://www.intel.com/content/www/us/en/security-center/default.html) +for information on how to report a potential security issue or vulnerability. + +See also: [Security Policy](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/SECURITY.md). diff --git a/README.md b/README.md index a8bbe2c2e1..18e46403e7 100644 --- a/README.md +++ b/README.md @@ -1,98 +1,60 @@ -[![Build and test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml) -[![Triton wheels](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml) -[![Conda test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/conda-build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/conda-build-test.yml) +
+ Triton logo +
-# Intel® XPU Backend for Triton\* +The Triton Conference is happening again on September 17th, 2024 in Fremont (CA)! -This is the development repository of Intel® XPU Backend for Triton\*, a new [Triton](https://github.com/triton-lang/triton/) backend for Intel GPUs. Intel® XPU Backend for Triton\* is a out of tree backend module for [Triton](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) used to provide best-in-class performance and productivity on any Intel GPUs for [PyTorch](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) and standalone usage. +If you are interested in attending, please fill up [this form](https://docs.google.com/forms/d/e/1FAIpQLSecHC1lkalcm0h3JDUbspekDX5bmBvMxgVTLaK3e-61bzDDbg/viewform). -# Compatibility - -* Operating systems: - * [Ubuntu 22.04](http://releases.ubuntu.com/22.04) -* GPU Cards: - * [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) - * [Intel® Data Center Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html) - * [Intel Arc A770](https://www.intel.com/content/www/us/en/products/sku/229151/intel-arc-a770-graphics-16gb/specifications.html) -* GPU Drivers: - * Latest [Long Term Support (LTS) Release](https://dgpu-docs.intel.com/driver/installation.html) - * Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) -* Toolchain: - * Latest [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) +| **`Documentation`** | **`Nightly Wheels`** | +|-------------------- | -------------------- | +| [![Documentation](https://github.com/triton-lang/triton/actions/workflows/documentation.yml/badge.svg)](https://triton-lang.org/) | [![Wheels](https://github.com/triton-lang/triton/actions/workflows/wheels.yml/badge.svg?branch=release/2.0.x)](https://github.com/triton-lang/triton/actions/workflows/wheels.yml) | -Note that Intel® XPU Backend for Triton\* is not compatible with Intel® Extension for PyTorch\* and Intel® oneAPI Base Toolkit\*. - -# Quick Installation +# Triton -## Prerequisites +This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs. -1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver -2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) -3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) +The foundations of this project are described in the following MAPL2019 publication: [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). Please consider citing this work if you use Triton! -## Install PyTorch and Triton from nightly wheels +The [official documentation](https://triton-lang.org) contains installation instructions and tutorials. See also these third-party [Triton puzzles](https://github.com/srush/Triton-Puzzles), which can all be run using the Triton interpreter -- no GPU required. -Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both can be installed from nightly wheels. -Navigate to the [nightly wheels workflow](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml), -select the most recent successful run on the top of the page and download an artifact for the corresponding Python version. -Extract the archive and in the extracted directory execute: - -```shell -pip install torch-*.whl triton-*.whl -``` +# Quick Installation -Before using Intel® XPU Backend for Triton\* you need to initialize the toolchain. -The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user). +You can install the latest stable release of Triton from pip: ```shell -# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs -source /opt/intel/oneapi/setvars.sh +pip install triton ``` -# Install from source - -## Prerequisites - -1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver -2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) -3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) - -## Compile PyTorch and Triton from source - -Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both need to be compiled at the same time. +Binary wheels are available for CPython 3.8-3.12 and PyPy 3.8-3.9. -Before compiling PyTorch and Intel® XPU Backend for Triton\* you need to initialize the toolchain. -The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user). +And the latest nightly release: ```shell -# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs -source /opt/intel/oneapi/setvars.sh +pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly ``` -Clone this repository: +# Install from source ```shell -git clone https://github.com/intel/intel-xpu-backend-for-triton.git -cd intel-xpu-backend-for-triton -``` - -To avoid potential conflicts with installed packages it is recommended to create and activate a new Python virtual environment: +git clone https://github.com/triton-lang/triton.git; +cd triton; -```shell -python -m venv .venv --prompt triton -source .venv/bin/activate +pip install ninja cmake wheel pybind11; # build-time dependencies +pip install -e python ``` -Compile and install PyTorch: +Or with a virtualenv: ```shell -scripts/install-pytorch.sh --source -``` +git clone https://github.com/triton-lang/triton.git; +cd triton; -Compile and install Intel® XPU Backend for Triton\*: +python -m venv .venv --prompt triton; +source .venv/bin/activate; -```shell -scripts/compile-triton.sh +pip install ninja cmake wheel pybind11; # build-time dependencies +pip install -e python ``` # Building with a custom LLVM @@ -103,17 +65,36 @@ downloads a prebuilt LLVM, but you can also build LLVM from source and use that. LLVM does not have a stable API, so the Triton build will not work at an arbitrary LLVM version. -1. Find the version of LLVM that Triton builds against. -Check `cmake/llvm-hash.txt` to see the current version. +1. Find the version of LLVM that Triton builds against. Check +`cmake/llvm-hash.txt` to see the current version. For example, if it says: + 49af6502c6dcb4a7f7520178bd14df396f78240c -2. Checkout LLVM at this revision to the directory `llvm`, -which must be in the same directory as `intel-xpu-backend-for-triton`: + This means that the version of Triton you have builds against + [LLVM](https://github.com/llvm/llvm-project) 49af6502. -3. In the directory `intel-xpu-backend-for-triton`, build Triton with custom LLVM: +2. `git checkout` LLVM at this revision. Optionally, make additional + modifications to LLVM. - ```shell - ./scripts/compile-triton.sh --llvm --triton - ``` +3. [Build LLVM](https://llvm.org/docs/CMake.html). For example, you might run + + $ cd $HOME/llvm-project # your clone of LLVM. + $ mkdir build + $ cd build + $ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON ../llvm -DLLVM_ENABLE_PROJECTS="mlir;llvm" -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU" + $ ninja + +4. Grab a snack, this will take a while. + +5. Build Triton as above, but set the following environment variables. + + # Modify as appropriate to point to your LLVM build. + $ export LLVM_BUILD_DIR=$HOME/llvm-project/build + + $ cd + $ LLVM_INCLUDE_DIRS=$LLVM_BUILD_DIR/include \ + LLVM_LIBRARY_DIR=$LLVM_BUILD_DIR/lib \ + LLVM_SYSPATH=$LLVM_BUILD_DIR \ + pip install -e python # Tips for building @@ -131,7 +112,7 @@ which must be in the same directory as `intel-xpu-backend-for-triton`: Without this, every invocation of `pip install` uses a different symlink to cmake, and this forces ninja to rebuild most of the `.a` files. -- VSCcode IntelliSense has some difficulty figuring out how to build Triton's C++ +- vscode intellisense has some difficulty figuring out how to build Triton's C++ (probably because, in our build, users don't invoke cmake directly, but instead use setup.py). Teach vscode how to compile Triton as follows. @@ -153,7 +134,36 @@ There currently isn't a turnkey way to run all the Triton tests, but you can follow the following recipe. ```shell -scripts/test-triton.sh +# One-time setup. Note we have to reinstall local Triton because torch +# overwrites it with the public version. +$ pip install scipy numpy torch pytest lit pandas matplotlib && pip install -e python + +# Run Python tests using your local GPU. +$ python3 -m pytest python/test/unit + +# Move to builddir. Fill in <...> with the full path, e.g. +# `cmake.linux-x86_64-cpython-3.11`. +$ cd python/build/cmake<...> + +# Run C++ unit tests. +$ ctest -j32 + +# Run lit tests. +$ lit test +``` + +You may find it helpful to make a symlink to the builddir and tell your local +git to ignore it. + +```shell +$ ln -s python/build/cmake<...> build +$ echo build >> .git/info/exclude +``` + +Then you can e.g. rebuild and run lit with the following command. + +```shell +$ ninja -C build && ( cd build ; lit test ) ``` # Tips for hacking @@ -203,124 +213,27 @@ For detailed instructions on how to debug Triton's frontend, please refer to thi - `TRITON_DEFAULT_FP_FUSION` overrides the default behavior of allowing fp fusion (mul+add->fma). - `MLIR_ENABLE_REMARK` enables the performance warnings that are emitted as remarks. -# Usage Guide - -## Code Modifications -Intel® XPU Backend for Triton\* requires a special version of PyTorch that can be built from sources or installed from nightly wheels. - -1. Add `import torch` for xpu support. -2. Put the tensor and models to XPU by calling `to('xpu')`. - -This repository contains modified [tutorials](python/tutorials) that must be used with Intel® XPU Backend for Triton\*. - -The following examples show modifications for the user code. - -### Example 1 : Triton Kernel - -This example is a modified version of [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) triton kernel. Please refer to [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) for detailed comments and illustration about the code semantics. - -Comparing to the original code, the following code modifies: - -```Python -import torch -import triton -import triton.language as tl - - -@triton.jit -def add_kernel( - x_ptr, - y_ptr, - output_ptr, - n_elements, - BLOCK_SIZE: tl.constexpr, -): - pid = tl.program_id(axis=0) - block_start = pid * BLOCK_SIZE - offsets = block_start + tl.arange(0, BLOCK_SIZE) - mask = offsets < n_elements - x = tl.load(x_ptr + offsets, mask=mask) - y = tl.load(y_ptr + offsets, mask=mask) - output = x + y - tl.store(output_ptr + offsets, output, mask=mask) - -def add(x: torch.Tensor, y: torch.Tensor): - # Put the tensor to xpu - output = torch.empty_like(x).xpu() - assert x.is_xpu and y.is_xpu and output.is_xpu - n_elements = output.numel() - grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),) - add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024) - - return output - -# For manual_seed, needs to use API for XPU -torch.xpu.manual_seed(0) -size = 512 -# For tensors, needs to be put on XPU -x = torch.rand(size, device='xpu') -y = torch.rand(size, device='xpu') -output_torch = x + y -output_triton = add(x, y) -print(output_torch) -print(output_triton) -print( - f'The maximum difference between torch and triton is ' - f'{torch.max(torch.abs(output_torch - output_triton))}' -) -``` - -### Example 2 : End-to-End Model -Triton is transparent for end-to-end models. One could easily use `torch.compile` with `inductor` as backend by default. It will automatically generates triton kernel and gets benefit from it. - -```Python -import torch -from torch._dynamo.testing import rand_strided - -from torch.nn import * -class simpleModel(torch.nn.Module): - def __init__(self): - super().__init__() - # tensors inside model should be on xpu - self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32) - - def forward(self, x): - z = x + self.y - return z - -# tensors passed to the model should be on xpu -x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32) -xpu_model = simpleModel() -# Call torch.compile for optimization -optimized_mod = torch.compile(xpu_model) - -graph_result = optimized_mod(x) -``` - -## Performance Analysis Guide - -There are several ways of doing performance analysis. We recommend using `torch.profiler` for end-to-end performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis. We provide a comprehensive guide for those two: -1. [end_to_end_tests#profiling settings](docs/test_docs/end_to_end_tests.md#profiling-settings) section for using `torch.profiler`. -2. [VTune Profiling Guide](docs/VTune_Profiling.md) for kernel analysis. +# Changelog -Note that the user needs to explicitly set `TRITON_XPU_PROFILE=1` when the user needs to enable kernel profiling. +Version 2.0 is out! New features include: -```Bash -export TRITON_XPU_PROFILE=1 -``` +- Many, many bug fixes +- Performance improvements +- Backend rewritten to use MLIR +- Support for kernels that contain back-to-back matmuls (e.g., flash attention) # Contributing -Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/intel/intel-xpu-backend-for-triton). For more detailed instructions, please visit our [contributor's guide](CONTRIBUTING.md). - -## License +Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/triton-lang/triton/). For more detailed instructions, please visit our [contributor's guide](CONTRIBUTING.md). -_MIT License_. As found in [LICENSE](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/LICENSE) file. +# Compatibility +Supported Platforms: -## Security +- Linux -See Intel's [Security Center](https://www.intel.com/content/www/us/en/security-center/default.html) -for information on how to report a potential security issue or vulnerability. +Supported Hardware: -See also: [Security Policy](security.md) +- NVIDIA GPUs (Compute Capability 8.0+) +- AMD GPUs (ROCm 5.2+) +- Under development: CPUs