Extremely slow fp8 conv2d wgrad operation #103

jimgao1 · 2024-08-27T18:26:52Z

Describe the bug
fp8 e4m3 wgrad seems to be extremely slow compared to both FP32 and FP16, often 50x to 100x slower.

I have attached the profiling results in this Google spreadsheet.

I have tested a variety of problem sizes. For each size I have measured fp16 wgrad and fp8 wgrad with a number of different variants (wrt the IO/intermediate/compute data types).

Expected behavior
We expect fp8 wgrad operators to be at least as fast (if not faster) than its fp16 and fp32 counterparts.

System Environment (please complete the following information):

cudnn_frontend version: v1.6.1 (commit 2533f5e)
cudnn_backend version: v9.3.0
GPU arch: H100
cuda runtime version: 12.2
cuda driver version: 535.161.08
host compiler: g++
OS: Ubuntu 22.04.4 LTS

API logs

Both frontend and backend logs are attached in this gist.

To Reproduce
Compile and run the benchmarking script.
Command I used to compile is:

/usr/local/cuda/bin/nvcc -I/home/ybgao/third_party/cudnn-frontend/include -std=c++20 -gencode=arch=compute_90,code=sm_90 -lcudnn -o main main.cu

Additional context
This issue references this post on nvidia forums.

The text was updated successfully, but these errors were encountered:

yanqinz2 · 2024-08-27T19:30:12Z

Hi jimgao1,

I noticed that you are using NHWC layout for both dy and X tensor for fp8 wgrad, which is a low performance configuration (https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#supported-graph-patterns : "All tensors can be in either NHWC or CHWN layout. In general, both dy and X tensors provide the best performance when they are in a CHWN layout.").
Could you try to set dy and X tensors to have CHWN layout?

jimgao1 · 2024-08-27T20:13:18Z

Hi @yanqinz2, thanks for the suggestion! I tried setting the layout to CHWN but the performance remains to be suboptimal. Included below is my change as well as the profiling results for CHWN.

Code change:

    auto X = graph->tensor(fe::graph::Tensor_attributes()
                    .set_name("image")
                    .set_dim({n, k, p, q})
		    // .set_stride({k * p * q, 1, k * p, k})
                    .set_stride({1, n * p * q, n * q, n})
                    .set_data_type(io_type));

    auto DY = graph->tensor(fe::graph::Tensor_attributes()
                    .set_name("grad")
                    .set_dim({n, c, h, w})
                    // .set_stride({h * w * c, 1, w * c, c})
                    .set_stride({1, h * w * n, w * n, n})
                    .set_data_type(io_type));

Measurements for n = 64, h = 56, w = 56, k = 64, c = 64, r = 3, s = 3, stride = 1, padding = 1:

config	runtime
fp16	52.16us
fp8 float accum	1.41ms
fp8 half accum	1.41ms
fp8 half accum fast math	1.14ms

jimgao1 · 2024-09-18T17:17:56Z

May I know if this issue is related to the:

For some ConvolutionBwdFilter depthwise convolutional workloads, cuDNN may generate racecheck hazard warnings. This issue exists in previous v9 releases and will be fixed in a future release.

item in the cuDNN release notes?

yanqinz2 · 2024-09-18T19:22:36Z

It is not related with the item in the release note. It is actually an issue with heuristics mode A. We are actively working on this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow fp8 conv2d wgrad operation #103

Extremely slow fp8 conv2d wgrad operation #103

jimgao1 commented Aug 27, 2024 •

edited

Loading

yanqinz2 commented Aug 27, 2024

jimgao1 commented Aug 27, 2024

jimgao1 commented Sep 18, 2024

yanqinz2 commented Sep 18, 2024

Extremely slow fp8 conv2d wgrad operation #103

Extremely slow fp8 conv2d wgrad operation #103

Comments

jimgao1 commented Aug 27, 2024 • edited Loading

yanqinz2 commented Aug 27, 2024

jimgao1 commented Aug 27, 2024

jimgao1 commented Sep 18, 2024

yanqinz2 commented Sep 18, 2024

jimgao1 commented Aug 27, 2024 •

edited

Loading