Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

Closed
IMbackK opened this issue Nov 6, 2024 · 2 comments
Closed

GEMM Perfromance when M/N == 1 is mutch slower than theorectial. #1321

IMbackK opened this issue Nov 6, 2024 · 2 comments

Comments

@IMbackK
Copy link

IMbackK commented Nov 6, 2024

Similar to ROCm/rocBLAS#1425

GEMM operations that have degenerated to GEMV are a special case that is pretty common, esp as client libraries like pytorch dispatch GEMV operations as GEMM.

hipblaslt's performance in this case is much worse than available memory bandwidth would suggest is theoretical, as can be seen, for instance on mi100 we get a mere 240 Gflops:

hipblaslt-bench -f matmul -r s -m 1 -n 16192 -k 16192 -i 50
[0]:transA,transB,grouped_gemm,batch_count,m,n,k,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,a_type,b_type,c_type,d_type,compute_type,scaleA,scaleB,scaleC,scaleD,amaxD,activation_type,bias_vector,bias_type,hipblaslt-Gflops,hipblaslt-GB/s,us
    N,N,0,1,1,16192,16192,1,1,16192,0,16192,262180864,1,16192,1,16192,f32_r,f32_r,f32_r,f32_r,f32_r,0,0,0,0,0,none,0,f32_r,239.928,446.955,2185.5

The equivalent GEMV call to rocblas is significantly faster:

rocblas-bench -f gemv -r s -m 16192 -n 16192 --lda 16192 -i 50
transA,M,N,alpha,lda,incx,beta,incy,rocblas-Gflops,rocblas-GB/s,us
N,16192,16192,1,16192,1,0,1, 450.232, 900.519, 1164.72

While it is of course possible to check and dispatch to GEMV in client code when possible, doing this at every operation that could potentially degrade into GEMV is impractical or even impossible when this would have to be done in a 3rd party library (like pytorch).

Thus it would be much better if hipBlasLt (annoying naming inconstancy there btw - why is it not rocBlasLt with hipBlasLt being just a header) either contain a special kernel for m/n==1 or should cross dispatch to rocblas's GEMV when possible.

PS: as of cb7a949 the issue template is now completely broken and no longer shows up.

@ppanchad-amd
Copy link

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

@jamesxu2
Copy link

Hi @IMbackK , thanks for reporting this. I reached out internally and I'm told that the hipBLASLt solution pool for MI100 is simply smaller. There are a small number of tuned kernels for MI100 which are provided only for functional enablement, and you should expect to see suboptimal performance on those devices.

[...] cross dispatch to rocblas's GEMV when possible

Regarding cross-dispatch to rocBLAS (/Tensile) I think you should just keep an eye on the existing discussion (ROCm/rocBLAS#1425) on this.

PS: as of cb7a949 the issue template is now completely broken and no longer shows up.

The issue template should also be fixed. Thanks for reporting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants