You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GEMM operations that have degenerated to GEMV are a special case that is pretty common, esp as client libraries like pytorch dispatch GEMV operations as GEMM.
hipblaslt's performance in this case is much worse than available memory bandwidth would suggest is theoretical, as can be seen, for instance on mi100 we get a mere 240 Gflops:
While it is of course possible to check and dispatch to GEMV in client code when possible, doing this at every operation that could potentially degrade into GEMV is impractical or even impossible when this would have to be done in a 3rd party library (like pytorch).
Thus it would be much better if hipBlasLt (annoying naming inconstancy there btw - why is it not rocBlasLt with hipBlasLt being just a header) either contain a special kernel for m/n==1 or should cross dispatch to rocblas's GEMV when possible.
PS: as of cb7a949 the issue template is now completely broken and no longer shows up.
The text was updated successfully, but these errors were encountered:
Hi @IMbackK , thanks for reporting this. I reached out internally and I'm told that the hipBLASLt solution pool for MI100 is simply smaller. There are a small number of tuned kernels for MI100 which are provided only for functional enablement, and you should expect to see suboptimal performance on those devices.
[...] cross dispatch to rocblas's GEMV when possible
Regarding cross-dispatch to rocBLAS (/Tensile) I think you should just keep an eye on the existing discussion (ROCm/rocBLAS#1425) on this.
PS: as of cb7a949 the issue template is now completely broken and no longer shows up.
The issue template should also be fixed. Thanks for reporting it.
Similar to ROCm/rocBLAS#1425
GEMM operations that have degenerated to GEMV are a special case that is pretty common, esp as client libraries like pytorch dispatch GEMV operations as GEMM.
hipblaslt's performance in this case is much worse than available memory bandwidth would suggest is theoretical, as can be seen, for instance on mi100 we get a mere 240 Gflops:
The equivalent GEMV call to rocblas is significantly faster:
While it is of course possible to check and dispatch to GEMV in client code when possible, doing this at every operation that could potentially degrade into GEMV is impractical or even impossible when this would have to be done in a 3rd party library (like pytorch).
Thus it would be much better if hipBlasLt (annoying naming inconstancy there btw - why is it not rocBlasLt with hipBlasLt being just a header) either contain a special kernel for m/n==1 or should cross dispatch to rocblas's GEMV when possible.
PS: as of cb7a949 the issue template is now completely broken and no longer shows up.
The text was updated successfully, but these errors were encountered: