Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PT2.7][Torch.complie] Performance analysis and optimization #1004

Open
riverliuintel opened this issue Oct 22, 2024 · 5 comments
Open

[PT2.7][Torch.complie] Performance analysis and optimization #1004

riverliuintel opened this issue Oct 22, 2024 · 5 comments
Assignees
Milestone

Comments

@riverliuintel
Copy link
Contributor

🚀 The feature, motivation and pitch

Analyze Triton kernels data and report to Triton XPU.

  1. Recollect reasonable competitive GPU performance data
  2. Use TorchInductor built-in benchmark tool to detect slower XPU triton kernels.

Alternatives

No response

Additional context

No response

@jianyizh
Copy link
Contributor

jianyizh commented Nov 11, 2024

scatter op issue: intel/intel-xpu-backend-for-triton#2665,
In general, should we always use fp32 instead of fp16/bf16 for atomic related ops regardless accuracy change?

@jianyizh
Copy link
Contributor

jianyizh commented Nov 21, 2024

layout issue

  1. when number of conv is small, inductor will close layout opt. We have to force it open by TORCHINDUCTOR_FORCE_LAYOUT_OPT, otherwise we may meet inefficient kernel like cat_layernorm in this issue Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 intel-xpu-backend-for-triton#2229
  2. For both xpu and cuda, when there are more nodes between conv, there will be unnecessary transpose. For example, conv (channel last) + bias and leakyrelu fusion (to channel first) + avg_pool (to channel last) + conv. It seems inductor does not propagate layout. Fuse bias and activation into conv will mitigate.
  3. Currently we do not open the graph freezing feature. It should be helpful for channel last inference
  4. for models like pytorch_unet that contain conv which in-channel > out-channel, inductor will choose channel first. conv takes 36.5ms out of e2e ~75ms for fp16 training. After force channel last, conv takes 15.5 ms, but batch norm become much slower. We should use channel last in this case depend on Channel last batch norm have bad performance intel-xpu-backend-for-triton#3001

@jianyizh
Copy link
Contributor

RNN related ops: #1109
We should fuse it using onednn instead of using torch compile for these small ops.

@jianyizh
Copy link
Contributor

@jianyizh
Copy link
Contributor

pad mm: #1129

@riverliuintel riverliuintel changed the title [PT2.6][Torch.complie] Performance analysis and optimization [PT2.7][Torch.complie] Performance analysis and optimization Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants