GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

IMbackK · 2024-11-06T13:46:33Z

GEMM operations that have degenerated to GEMV are a special case that is pretty common, esp as client libraries like pytorch dispatch GEMV operations as GEMM.

hipblaslt's performance in this case is much worse than available memory bandwidth would suggest is theoretical, as can be seen, for instance on mi100 we get a mere 240 Gflops:

hipblaslt-bench -f matmul -r s -m 1 -n 16192 -k 16192 -i 50
[0]:transA,transB,grouped_gemm,batch_count,m,n,k,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,a_type,b_type,c_type,d_type,compute_type,scaleA,scaleB,scaleC,scaleD,amaxD,activation_type,bias_vector,bias_type,hipblaslt-Gflops,hipblaslt-GB/s,us
    N,N,0,1,1,16192,16192,1,1,16192,0,16192,262180864,1,16192,1,16192,f32_r,f32_r,f32_r,f32_r,f32_r,0,0,0,0,0,none,0,f32_r,239.928,446.955,2185.5

The equivalent GEMV call to rocblas is significantly faster:

rocblas-bench -f gemv -r s -m 16192 -n 16192 --lda 16192 -i 50
transA,M,N,alpha,lda,incx,beta,incy,rocblas-Gflops,rocblas-GB/s,us
N,16192,16192,1,16192,1,0,1, 450.232, 900.519, 1164.72

While it is of course possible to check and dispatch to GEMV in client code when possible, doing this at every operation that could potentially degrade into GEMV is impractical or even impossible when this would have to be done in a 3rd party library (like pytorch).

Thus it would be much better if hipBlasLt (annoying naming inconstancy there btw - why is it not rocBlasLt with hipBlasLt being just a header) either contain a special kernel for m/n==1 or should cross dispatch to rocblas's GEMV when possible.

PS: as of cb7a949 the issue template is now completely broken and no longer shows up.

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2024-11-06T15:35:15Z

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

jamesxu2 · 2024-11-12T16:46:48Z

Hi @IMbackK , thanks for reporting this. I reached out internally and I'm told that the hipBLASLt solution pool for MI100 is simply smaller. There are a small number of tuned kernels for MI100 which are provided only for functional enablement, and you should expect to see suboptimal performance on those devices.

[...] cross dispatch to rocblas's GEMV when possible

Regarding cross-dispatch to rocBLAS (/Tensile) I think you should just keep an eye on the existing discussion (ROCm/rocBLAS#1425) on this.

PS: as of cb7a949 the issue template is now completely broken and no longer shows up.

The issue template should also be fixed. Thanks for reporting it.

ppanchad-amd added the Under Investigation label Nov 6, 2024

jamesxu2 closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

IMbackK commented Nov 6, 2024 •

edited

Loading

ppanchad-amd commented Nov 6, 2024

jamesxu2 commented Nov 12, 2024

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

Comments

IMbackK commented Nov 6, 2024 • edited Loading

ppanchad-amd commented Nov 6, 2024

jamesxu2 commented Nov 12, 2024

IMbackK commented Nov 6, 2024 •

edited

Loading