diff --git a/.github/README.md b/.github/README.md
new file mode 100644
index 0000000000..0fc64b2764
--- /dev/null
+++ b/.github/README.md
@@ -0,0 +1,324 @@
+[![Build and test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml)
+[![Triton wheels](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml)
+
+# Intel® XPU Backend for Triton\*
+
+This is the development repository of Intel® XPU Backend for Triton\*, a new [Triton](https://github.com/triton-lang/triton) backend for Intel GPUs.
+Intel® XPU Backend for Triton\* is a out of tree backend module for [Triton](https://github.com/triton-lang/triton) used to provide best-in-class performance and productivity on any Intel GPUs for [PyTorch](https://github.com/pytorch/pytorch) and standalone usage.
+
+# Compatibility
+
+* Operating systems:
+  * [Ubuntu 22.04](http://releases.ubuntu.com/22.04)
+* GPU Cards:
+  * [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html)
+  * [Intel® Data Center Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html)
+  * [Intel Arc A770](https://www.intel.com/content/www/us/en/products/sku/229151/intel-arc-a770-graphics-16gb/specifications.html)
+* GPU Drivers:
+  * Latest [Long Term Support (LTS) Release](https://dgpu-docs.intel.com/driver/installation.html)
+  * Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html)
+* Toolchain:
+  * Latest [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+
+Note that Intel® XPU Backend for Triton\* is not compatible with Intel® Extension for PyTorch\* and Intel® oneAPI Base Toolkit\*.
+
+# Quick Installation
+
+## Prerequisites
+
+1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
+2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+
+## Install PyTorch and Triton from nightly wheels
+
+Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both can be installed from nightly wheels.
+Navigate to the [nightly wheels workflow](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml),
+select the most recent successful run on the top of the page and download an artifact for the corresponding Python version.
+Extract the archive and in the extracted directory execute:
+
+```shell
+pip install torch-*.whl triton-*.whl
+```
+
+Before using Intel® XPU Backend for Triton\* you need to initialize the toolchain.
+The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user).
+
+```shell
+# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs
+source /opt/intel/oneapi/setvars.sh
+```
+
+# Install from source
+
+## Prerequisites
+
+1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
+2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+
+## Compile PyTorch and Triton from source
+
+Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both need to be compiled at the same time.
+
+Before compiling PyTorch and Intel® XPU Backend for Triton\* you need to initialize the toolchain.
+The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user).
+
+```shell
+# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs
+source /opt/intel/oneapi/setvars.sh
+```
+
+Clone this repository:
+
+```shell
+git clone https://github.com/intel/intel-xpu-backend-for-triton.git
+cd intel-xpu-backend-for-triton
+```
+
+To avoid potential conflicts with installed packages it is recommended to create and activate a new Python virtual environment:
+
+```shell
+python -m venv .venv --prompt triton
+source .venv/bin/activate
+```
+
+Compile and install PyTorch:
+
+```shell
+scripts/install-pytorch.sh --source
+```
+
+Compile and install Intel® XPU Backend for Triton\*:
+
+```shell
+scripts/compile-triton.sh
+```
+
+# Building with a custom LLVM
+
+Triton uses LLVM to generate code for GPUs and CPUs.  Normally, the Triton build
+downloads a prebuilt LLVM, but you can also build LLVM from source and use that.
+
+LLVM does not have a stable API, so the Triton build will not work at an
+arbitrary LLVM version.
+
+1. Find the version of LLVM that Triton builds against.
+Check `cmake/llvm-hash.txt` to see the current version.
+
+2. Checkout LLVM at this revision to the directory `llvm`,
+which must be in the same directory as `intel-xpu-backend-for-triton`:
+
+3. In the directory `intel-xpu-backend-for-triton`, build Triton with custom LLVM:
+
+    ```shell
+    ./scripts/compile-triton.sh --llvm --triton
+    ```
+
+# Tips for building
+
+- Set `TRITON_BUILD_WITH_CLANG_LLD=true` as an environment variable to use clang
+  and lld.  lld in particular results in faster builds.
+
+- Set `TRITON_BUILD_WITH_CCACHE=true` to build with ccache.
+
+- Set `TRITON_HOME=/some/path` to change the location of the `.triton`
+  directory where Triton's cache is located and downloads are stored
+  during the build. By default, this is the user's home directory. It
+  can be changed anytime.
+
+- Pass `--no-build-isolation` to `pip install` to make nop builds faster.
+  Without this, every invocation of `pip install` uses a different symlink to
+  cmake, and this forces ninja to rebuild most of the `.a` files.
+
+- VSCcode IntelliSense has some difficulty figuring out how to build Triton's C++
+  (probably because, in our build, users don't invoke cmake directly, but
+  instead use setup.py).  Teach vscode how to compile Triton as follows.
+
+    - Do a local build. Run command `pip install -e python`
+    - Get the full path to the `compile_commands.json` file produced by the build:
+      `find python/build -name 'compile_commands.json' | xargs readlink -f`.
+      You might get a full path similar to `/Users/{username}/triton/python/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json`
+    - In vscode, install the
+      [C/C++
+      extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools),
+      then open the command palette (`Shift + Command + P` on Mac, or `Shift +
+      Ctrl + P` on Windows/Linux) and open `C/C++: Edit Configurations (UI)`.
+    - Open "Advanced Settings" and paste the full path to
+      `compile_commands.json` into the "Compile Commands" textbox.
+
+# Running tests
+
+There currently isn't a turnkey way to run all the Triton tests, but you can
+follow the following recipe.
+
+```shell
+scripts/test-triton.sh
+```
+
+# Tips for hacking
+
+For detailed instructions on how to debug Triton's frontend, please refer to this [tutorial](https://triton-lang.org/main/programming-guide/chapter-3/debugging.html). The following includes additional tips for hacking on Triton's backend.
+
+**Helpful environment variables**
+
+- `MLIR_ENABLE_DUMP=1` dumps the IR before every MLIR pass Triton runs, for all
+   kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only.
+  - Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`
+- `LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR.
+- `TRITON_INTERPRET=1` uses the Triton interpreter instead of running on the
+  GPU.  You can insert Python breakpoints in your kernel code!
+- `TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of
+  debugging information to stdout.  If this is too noisy, run with just
+  `TRITON_LLVM_DEBUG_ONLY` instead to limit the output.
+
+  An alternative way to reduce output noisiness is running with
+  `LLVM_IR_ENABLE_DUMP=1`, extract the IR before the LLVM pass of interest, and
+  then run LLVM's `opt` standalone, perhaps passing `-debug-only=foo` on the
+  command line.
+- `TRITON_LLVM_DEBUG_ONLY=<comma-separated>` is the equivalent of LLVM's
+  `-debug-only` command-line option. This limits the LLVM debug output to
+  specific pass or component names (which are specified using `#define
+  DEBUG_TYPE` throughout LLVM and Triton) in order to allow the debug output to
+  be less noisy. `TRITON_LLVM_DEBUG_ONLY` allows for one or more comma
+  separated values to be specified (eg
+  `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions` or
+  `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc"`).
+- `USE_IR_LOC={ttir,ttgir}` reparses the IR such that the location information
+  will be the line number of the IR file with that particular extension,
+  instead of line number of the python file. This can provide a direct mapping
+  from the IR to llir/ptx. When used with performance tools, it can provide a
+  breakdown on IR instructions.
+- `TRITON_PRINT_AUTOTUNING=1` prints out the best autotuning config and total time
+  spent for each kernel after autotuning is complete.
+- `DISABLE_LLVM_OPT` will disable llvm optimizations for make_llir and make_ptx
+  if its value is true when parsing as Bool. Otherwise, it will be parsed as a list
+  of flags to disable llvm optimizations. One usage case is
+  `DISABLE_LLVM_OPT="disable-lsr"`
+  Loop strength reduction is known to cause up to 10% performance changes for
+  certain kernels with register pressure.
+- `TRITON_ALWAYS_COMPILE=1` forces to compile kernels regardless of cache hit.
+- `MLIR_ENABLE_TIMING` dumps the timing information for each MLIR pass.
+- `LLVM_ENABLE_TIMING` dumps the timing information for each LLVM pass.
+- `TRITON_DEFAULT_FP_FUSION` overrides the default behavior of allowing fp fusion (mul+add->fma).
+- `MLIR_ENABLE_REMARK` enables the performance warnings that are emitted as remarks.
+
+# Usage Guide
+
+## Code Modifications
+Intel® XPU Backend for Triton\* requires a special version of PyTorch that can be built from sources or installed from nightly wheels.
+
+1. Add `import torch` for xpu support.
+2. Put the tensor and models to XPU by calling `to('xpu')`.
+
+This repository contains modified [tutorials](https://github.com/intel/intel-xpu-backend-for-triton/tree/main/python/tutorials) that must be used with Intel® XPU Backend for Triton\*.
+
+The following examples show modifications for the user code.
+
+### Example 1 : Triton Kernel
+
+This example is a modified version of [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) triton kernel. Please refer to [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) for detailed comments and illustration about the code semantics.
+
+Comparing to the original code, the following code modifies:
+
+```Python
+import torch
+import triton
+import triton.language as tl
+
+
+@triton.jit
+def add_kernel(
+    x_ptr,
+    y_ptr,
+    output_ptr,
+    n_elements,
+    BLOCK_SIZE: tl.constexpr,
+):
+    pid = tl.program_id(axis=0)
+    block_start = pid * BLOCK_SIZE
+    offsets = block_start + tl.arange(0, BLOCK_SIZE)
+    mask = offsets < n_elements
+    x = tl.load(x_ptr + offsets, mask=mask)
+    y = tl.load(y_ptr + offsets, mask=mask)
+    output = x + y
+    tl.store(output_ptr + offsets, output, mask=mask)
+
+def add(x: torch.Tensor, y: torch.Tensor):
+    # Put the tensor to xpu
+    output = torch.empty_like(x).xpu()
+    assert x.is_xpu and y.is_xpu and output.is_xpu
+    n_elements = output.numel()
+    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
+    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
+
+    return output
+
+# For manual_seed, needs to use API for XPU
+torch.xpu.manual_seed(0)
+size = 512
+# For tensors, needs to be put on XPU
+x = torch.rand(size, device='xpu')
+y = torch.rand(size, device='xpu')
+output_torch = x + y
+output_triton = add(x, y)
+print(output_torch)
+print(output_triton)
+print(
+    f'The maximum difference between torch and triton is '
+    f'{torch.max(torch.abs(output_torch - output_triton))}'
+)
+```
+
+### Example 2 : End-to-End Model
+Triton is transparent for end-to-end models. One could easily use `torch.compile` with `inductor` as backend by default. It will automatically generates triton kernel and gets benefit from it.
+
+```Python
+import torch
+from torch._dynamo.testing import rand_strided
+
+from torch.nn import *
+class simpleModel(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        # tensors inside model should be on xpu
+        self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
+
+    def forward(self, x):
+        z = x + self.y
+        return z
+
+# tensors passed to the model should be on xpu
+x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
+xpu_model = simpleModel()
+# Call torch.compile for optimization
+optimized_mod = torch.compile(xpu_model)
+
+graph_result = optimized_mod(x)
+```
+
+## Performance Analysis Guide
+
+There are several ways of doing performance analysis.
+We recommend using `torch.profiler` for end-to-end performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis.
+Note that the user needs to explicitly set `TRITON_XPU_PROFILE=1` when the user needs to enable kernel profiling.
+
+```Bash
+export TRITON_XPU_PROFILE=1
+```
+
+# Contributing
+
+Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/intel/intel-xpu-backend-for-triton). For more detailed instructions, please visit our [contributor's guide](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/CONTRIBUTING.md).
+
+## License
+
+_MIT License_. As found in [LICENSE](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/LICENSE) file.
+
+
+## Security
+
+See Intel's [Security Center](https://www.intel.com/content/www/us/en/security-center/default.html)
+for information on how to report a potential security issue or vulnerability.
+
+See also: [Security Policy](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/SECURITY.md).
diff --git a/README.md b/README.md
index a8bbe2c2e1..18e46403e7 100644
--- a/README.md
+++ b/README.md
@@ -1,98 +1,60 @@
-[![Build and test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/build-test.yml)
-[![Triton wheels](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml)
-[![Conda test](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/conda-build-test.yml/badge.svg?branch=main)](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/conda-build-test.yml)
+<div align="center">
+  <img src="https://lh5.googleusercontent.com/wzQKEsTFkrgNQO9JjhGH5wFvslJr1saLtLaJ_a6Fp_gNENpvt3VG7BmztwngU9hFJaU4CPwGiw1opQtDvTkLrxWRbO_a12Q-pdESWHgtmheIHcPbOL5ZMC4TSiJVe5ty1w=w3517" alt="Triton logo">
+</div>
 
-# Intel® XPU Backend for Triton\*
+The Triton Conference is happening again on September 17th, 2024 in Fremont (CA)!
 
-This is the development repository of Intel® XPU Backend for Triton\*, a new [Triton](https://github.com/triton-lang/triton/) backend for Intel GPUs. Intel® XPU Backend for Triton\* is a out of tree backend module for [Triton](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) used to provide best-in-class performance and productivity on any Intel GPUs for [PyTorch](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) and standalone usage.
+If you are interested in attending, please fill up [this form](https://docs.google.com/forms/d/e/1FAIpQLSecHC1lkalcm0h3JDUbspekDX5bmBvMxgVTLaK3e-61bzDDbg/viewform).
 
-# Compatibility
-
-* Operating systems:
-  * [Ubuntu 22.04](http://releases.ubuntu.com/22.04)
-* GPU Cards:
-  * [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html)
-  * [Intel® Data Center Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html)
-  * [Intel Arc A770](https://www.intel.com/content/www/us/en/products/sku/229151/intel-arc-a770-graphics-16gb/specifications.html)
-* GPU Drivers:
-  * Latest [Long Term Support (LTS) Release](https://dgpu-docs.intel.com/driver/installation.html)
-  * Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html)
-* Toolchain:
-  * Latest [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+| **`Documentation`** | **`Nightly Wheels`** |
+|-------------------- | -------------------- |
+| [![Documentation](https://github.com/triton-lang/triton/actions/workflows/documentation.yml/badge.svg)](https://triton-lang.org/) | [![Wheels](https://github.com/triton-lang/triton/actions/workflows/wheels.yml/badge.svg?branch=release/2.0.x)](https://github.com/triton-lang/triton/actions/workflows/wheels.yml) |
 
-Note that Intel® XPU Backend for Triton\* is not compatible with Intel® Extension for PyTorch\* and Intel® oneAPI Base Toolkit\*.
-
-# Quick Installation
+# Triton
 
-## Prerequisites
+This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.
 
-1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
-2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
-3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
+The foundations of this project are described in the following MAPL2019 publication: [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). Please consider citing this work if you use Triton!
 
-## Install PyTorch and Triton from nightly wheels
+The [official documentation](https://triton-lang.org) contains installation instructions and tutorials.  See also these third-party [Triton puzzles](https://github.com/srush/Triton-Puzzles), which can all be run using the Triton interpreter -- no GPU required.
 
-Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both can be installed from nightly wheels.
-Navigate to the [nightly wheels workflow](https://github.com/intel/intel-xpu-backend-for-triton/actions/workflows/nightly-wheels.yml),
-select the most recent successful run on the top of the page and download an artifact for the corresponding Python version.
-Extract the archive and in the extracted directory execute:
-
-```shell
-pip install torch-*.whl triton-*.whl
-```
+# Quick Installation
 
-Before using Intel® XPU Backend for Triton\* you need to initialize the toolchain.
-The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user).
+You can install the latest stable release of Triton from pip:
 
 ```shell
-# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs
-source /opt/intel/oneapi/setvars.sh
+pip install triton
 ```
 
-# Install from source
-
-## Prerequisites
-
-1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
-2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
-3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
-
-## Compile PyTorch and Triton from source
-
-Currently, Intel® XPU Backend for Triton\* requires a special version of PyTorch and both need to be compiled at the same time.
+Binary wheels are available for CPython 3.8-3.12 and PyPy 3.8-3.9.
 
-Before compiling PyTorch and Intel® XPU Backend for Triton\* you need to initialize the toolchain.
-The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user).
+And the latest nightly release:
 
 ```shell
-# replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs
-source /opt/intel/oneapi/setvars.sh
+pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
 ```
 
-Clone this repository:
+# Install from source
 
 ```shell
-git clone https://github.com/intel/intel-xpu-backend-for-triton.git
-cd intel-xpu-backend-for-triton
-```
-
-To avoid potential conflicts with installed packages it is recommended to create and activate a new Python virtual environment:
+git clone https://github.com/triton-lang/triton.git;
+cd triton;
 
-```shell
-python -m venv .venv --prompt triton
-source .venv/bin/activate
+pip install ninja cmake wheel pybind11; # build-time dependencies
+pip install -e python
 ```
 
-Compile and install PyTorch:
+Or with a virtualenv:
 
 ```shell
-scripts/install-pytorch.sh --source
-```
+git clone https://github.com/triton-lang/triton.git;
+cd triton;
 
-Compile and install Intel® XPU Backend for Triton\*:
+python -m venv .venv --prompt triton;
+source .venv/bin/activate;
 
-```shell
-scripts/compile-triton.sh
+pip install ninja cmake wheel pybind11; # build-time dependencies
+pip install -e python
 ```
 
 # Building with a custom LLVM
@@ -103,17 +65,36 @@ downloads a prebuilt LLVM, but you can also build LLVM from source and use that.
 LLVM does not have a stable API, so the Triton build will not work at an
 arbitrary LLVM version.
 
-1. Find the version of LLVM that Triton builds against.
-Check `cmake/llvm-hash.txt` to see the current version.
+1. Find the version of LLVM that Triton builds against.  Check
+`cmake/llvm-hash.txt` to see the current version. For example, if it says:
+       49af6502c6dcb4a7f7520178bd14df396f78240c
 
-2. Checkout LLVM at this revision to the directory `llvm`,
-which must be in the same directory as `intel-xpu-backend-for-triton`:
+   This means that the version of Triton you have builds against
+   [LLVM](https://github.com/llvm/llvm-project) 49af6502.
 
-3. In the directory `intel-xpu-backend-for-triton`, build Triton with custom LLVM:
+2. `git checkout` LLVM at this revision.  Optionally, make additional
+   modifications to LLVM.
 
-    ```shell
-    ./scripts/compile-triton.sh --llvm --triton
-    ```
+3. [Build LLVM](https://llvm.org/docs/CMake.html).  For example, you might run
+
+       $ cd $HOME/llvm-project  # your clone of LLVM.
+       $ mkdir build
+       $ cd build
+       $ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON ../llvm -DLLVM_ENABLE_PROJECTS="mlir;llvm" -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU"
+       $ ninja
+
+4. Grab a snack, this will take a while.
+
+5. Build Triton as above, but set the following environment variables.
+
+       # Modify as appropriate to point to your LLVM build.
+       $ export LLVM_BUILD_DIR=$HOME/llvm-project/build
+
+       $ cd <triton install>
+       $ LLVM_INCLUDE_DIRS=$LLVM_BUILD_DIR/include \
+         LLVM_LIBRARY_DIR=$LLVM_BUILD_DIR/lib \
+         LLVM_SYSPATH=$LLVM_BUILD_DIR \
+         pip install -e python
 
 # Tips for building
 
@@ -131,7 +112,7 @@ which must be in the same directory as `intel-xpu-backend-for-triton`:
   Without this, every invocation of `pip install` uses a different symlink to
   cmake, and this forces ninja to rebuild most of the `.a` files.
 
-- VSCcode IntelliSense has some difficulty figuring out how to build Triton's C++
+- vscode intellisense has some difficulty figuring out how to build Triton's C++
   (probably because, in our build, users don't invoke cmake directly, but
   instead use setup.py).  Teach vscode how to compile Triton as follows.
 
@@ -153,7 +134,36 @@ There currently isn't a turnkey way to run all the Triton tests, but you can
 follow the following recipe.
 
 ```shell
-scripts/test-triton.sh
+# One-time setup.  Note we have to reinstall local Triton because torch
+# overwrites it with the public version.
+$ pip install scipy numpy torch pytest lit pandas matplotlib && pip install -e python
+
+# Run Python tests using your local GPU.
+$ python3 -m pytest python/test/unit
+
+# Move to builddir.  Fill in <...> with the full path, e.g.
+# `cmake.linux-x86_64-cpython-3.11`.
+$ cd python/build/cmake<...>
+
+# Run C++ unit tests.
+$ ctest -j32
+
+# Run lit tests.
+$ lit test
+```
+
+You may find it helpful to make a symlink to the builddir and tell your local
+git to ignore it.
+
+```shell
+$ ln -s python/build/cmake<...> build
+$ echo build >> .git/info/exclude
+```
+
+Then you can e.g. rebuild and run lit with the following command.
+
+```shell
+$ ninja -C build && ( cd build ; lit test )
 ```
 
 # Tips for hacking
@@ -203,124 +213,27 @@ For detailed instructions on how to debug Triton's frontend, please refer to thi
 - `TRITON_DEFAULT_FP_FUSION` overrides the default behavior of allowing fp fusion (mul+add->fma).
 - `MLIR_ENABLE_REMARK` enables the performance warnings that are emitted as remarks.
 
-# Usage Guide
-
-## Code Modifications
-Intel® XPU Backend for Triton\* requires a special version of PyTorch that can be built from sources or installed from nightly wheels.
-
-1. Add `import torch` for xpu support.
-2. Put the tensor and models to XPU by calling `to('xpu')`.
-
-This repository contains modified [tutorials](python/tutorials) that must be used with Intel® XPU Backend for Triton\*.
-
-The following examples show modifications for the user code.
-
-### Example 1 : Triton Kernel
-
-This example is a modified version of [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) triton kernel. Please refer to [Vector Add](https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#vector-addition) for detailed comments and illustration about the code semantics.
-
-Comparing to the original code, the following code modifies:
-
-```Python
-import torch
-import triton
-import triton.language as tl
-
-
-@triton.jit
-def add_kernel(
-    x_ptr,
-    y_ptr,
-    output_ptr,
-    n_elements,
-    BLOCK_SIZE: tl.constexpr,
-):
-    pid = tl.program_id(axis=0)
-    block_start = pid * BLOCK_SIZE
-    offsets = block_start + tl.arange(0, BLOCK_SIZE)
-    mask = offsets < n_elements
-    x = tl.load(x_ptr + offsets, mask=mask)
-    y = tl.load(y_ptr + offsets, mask=mask)
-    output = x + y
-    tl.store(output_ptr + offsets, output, mask=mask)
-
-def add(x: torch.Tensor, y: torch.Tensor):
-    # Put the tensor to xpu
-    output = torch.empty_like(x).xpu()
-    assert x.is_xpu and y.is_xpu and output.is_xpu
-    n_elements = output.numel()
-    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
-    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
-
-    return output
-
-# For manual_seed, needs to use API for XPU
-torch.xpu.manual_seed(0)
-size = 512
-# For tensors, needs to be put on XPU
-x = torch.rand(size, device='xpu')
-y = torch.rand(size, device='xpu')
-output_torch = x + y
-output_triton = add(x, y)
-print(output_torch)
-print(output_triton)
-print(
-    f'The maximum difference between torch and triton is '
-    f'{torch.max(torch.abs(output_torch - output_triton))}'
-)
-```
-
-### Example 2 : End-to-End Model
-Triton is transparent for end-to-end models. One could easily use `torch.compile` with `inductor` as backend by default. It will automatically generates triton kernel and gets benefit from it.
-
-```Python
-import torch
-from torch._dynamo.testing import rand_strided
-
-from torch.nn import *
-class simpleModel(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        # tensors inside model should be on xpu
-        self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
-
-    def forward(self, x):
-        z = x + self.y
-        return z
-
-# tensors passed to the model should be on xpu
-x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
-xpu_model = simpleModel()
-# Call torch.compile for optimization
-optimized_mod = torch.compile(xpu_model)
-
-graph_result = optimized_mod(x)
-```
-
-## Performance Analysis Guide
-
-There are several ways of doing performance analysis. We recommend using `torch.profiler` for end-to-end performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis. We provide a comprehensive guide for those two:
-1. [end_to_end_tests#profiling settings](docs/test_docs/end_to_end_tests.md#profiling-settings) section for using `torch.profiler`.
-2. [VTune Profiling Guide](docs/VTune_Profiling.md) for kernel analysis.
+# Changelog
 
-Note that the user needs to explicitly set `TRITON_XPU_PROFILE=1` when the user needs to enable kernel profiling.
+Version 2.0 is out! New features include:
 
-```Bash
-export TRITON_XPU_PROFILE=1
-```
+- Many, many bug fixes
+- Performance improvements
+- Backend rewritten to use MLIR
+- Support for kernels that contain back-to-back matmuls (e.g., flash attention)
 
 # Contributing
 
-Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/intel/intel-xpu-backend-for-triton). For more detailed instructions, please visit our [contributor's guide](CONTRIBUTING.md).
-
-## License
+Community contributions are more than welcome, whether it be to fix bugs or to add new features at [github](https://github.com/triton-lang/triton/). For more detailed instructions, please visit our [contributor's guide](CONTRIBUTING.md).
 
-_MIT License_. As found in [LICENSE](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/LICENSE) file.
+# Compatibility
 
+Supported Platforms:
 
-## Security
+- Linux
 
-See Intel's [Security Center](https://www.intel.com/content/www/us/en/security-center/default.html)
-for information on how to report a potential security issue or vulnerability.
+Supported Hardware:
 
-See also: [Security Policy](security.md)
+- NVIDIA GPUs (Compute Capability 8.0+)
+- AMD GPUs (ROCm 5.2+)
+- Under development: CPUs