[PT2.7][Torch.complie] Performance analysis and optimization #1004

riverliuintel · 2024-10-22T12:01:18Z

🚀 The feature, motivation and pitch

Analyze Triton kernels data and report to Triton XPU.

Recollect reasonable competitive GPU performance data
Use TorchInductor built-in benchmark tool to detect slower XPU triton kernels.

Alternatives

No response

Additional context

No response

jianyizh · 2024-11-11T07:18:11Z

scatter op issue: intel/intel-xpu-backend-for-triton#2665,
In general, should we always use fp32 instead of fp16/bf16 for atomic related ops regardless accuracy change?

jianyizh · 2024-11-21T09:12:19Z

layout issue

when number of conv is small, inductor will close layout opt. We have to force it open by TORCHINDUCTOR_FORCE_LAYOUT_OPT, otherwise we may meet inefficient kernel like cat_layernorm in this issue Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 intel-xpu-backend-for-triton#2229
For both xpu and cuda, when there are more nodes between conv, there will be unnecessary transpose. For example, conv (channel last) + bias and leakyrelu fusion (to channel first) + avg_pool (to channel last) + conv. It seems inductor does not propagate layout. Fuse bias and activation into conv will mitigate.
Currently we do not open the graph freezing feature. It should be helpful for channel last inference
for models like pytorch_unet that contain conv which in-channel > out-channel, inductor will choose channel first. conv takes 36.5ms out of e2e ~75ms for fp16 training. After force channel last, conv takes 15.5 ms, but batch norm become much slower. We should use channel last in this case depend on Channel last batch norm have bad performance intel-xpu-backend-for-triton#3001

jianyizh · 2024-11-21T09:44:02Z

RNN related ops: #1109
We should fuse it using onednn instead of using torch compile for these small ops.

jianyizh · 2024-11-29T08:40:40Z

sdpa pattern: #1128
We need align with cuda in https://github.com/pytorch/pytorch/blob/v2.5.1/torch/_inductor/fx_passes/fuse_attention.py

jianyizh · 2024-11-29T08:57:34Z

pad mm: #1129

riverliuintel added feature performance labels Oct 22, 2024

riverliuintel added this to the PT2.6 - Feature Freeze milestone Oct 22, 2024

riverliuintel assigned retonym, jianyizh and weishi-deng Oct 22, 2024

riverliuintel modified the milestones: PT2.6 - Feature Freeze, PT2.7 Dec 10, 2024

riverliuintel changed the title ~~[PT2.6][Torch.complie] Performance analysis and optimization~~ [PT2.7][Torch.complie] Performance analysis and optimization Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PT2.7][Torch.complie] Performance analysis and optimization #1004

[PT2.7][Torch.complie] Performance analysis and optimization #1004

riverliuintel commented Oct 22, 2024

jianyizh commented Nov 11, 2024 •

edited

Loading

jianyizh commented Nov 21, 2024 •

edited

Loading

jianyizh commented Nov 21, 2024

jianyizh commented Nov 29, 2024

jianyizh commented Nov 29, 2024

[PT2.7][Torch.complie] Performance analysis and optimization #1004

[PT2.7][Torch.complie] Performance analysis and optimization #1004

Comments

riverliuintel commented Oct 22, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

jianyizh commented Nov 11, 2024 • edited Loading

jianyizh commented Nov 21, 2024 • edited Loading

jianyizh commented Nov 21, 2024

jianyizh commented Nov 29, 2024

jianyizh commented Nov 29, 2024

jianyizh commented Nov 11, 2024 •

edited

Loading

jianyizh commented Nov 21, 2024 •

edited

Loading