-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Specification] oneMKL lapack to allow asynchronous functions #589
Comments
Hi @JackAKirk, thanks for the RFC. The oneMKL LAPACK team has had an ongoing discussion on the issue you raise which I'll summarize here. We agree with your assessment of the blocking nature of using exceptions for computation errors and find it entirely reasonable to replace them with info variables (or arrays in the batch case).
SYCL does not allow exceptions to be thrown in kernel scope, we're only aware of the possibility to throw asynchronous exceptions from host_tasks which limits their usefulness.
Exception handling of computation errors is not the only blocker for asynchronous behavior. As we understand it, SYCL provides host_task for scheduling CPU tasks with device tasks. A limitation of host_task is that it is undefined behavior to capture queues or events, so even if a kernel updates an info variable it is not possible to asynchronously schedule a task conditioned on the outcome of a prior kernel within the SYCL framework. Furthermore, several oneMKL LAPACK functions do not lend themselves to performant GPU-only implementations and so perform some critical sections on the CPU. While the GPU portions are bound to the context provided by the SYCL queue, the CPU portions generally assume they have unfettered access to CPU resources. For these routines the benefit of asynchronicity is unclear to us. |
Thanks for the quick reply!
oneMKL is a library and does not have to use only the existing sycl 2020 specification. In fact we have already solved this issue for the two backends that it affects via the enqueue_native_command dpc++ extension: please see #572. As I understand it this completely resolves the issue you raise here.
Sure I understand that certain functions (and/or certain backends) may not be able to take advantage of this. However the cusolver and rocsolver backends have a large number of functions to which such limitations do not currently exist; it also sounds like intel backends at least have a few cases that could take advantage of such an improved interface? And I expect that future generations of intel implementations will improve on this current situations?. |
Glad to hear the host_task issues have been worked around, if at least for some backends. We support this change; do you plan on driving the spec update over on https://github.com/uxlfoundation/oneAPI-spec? |
@Ruyk could I work on this? these linear algebra operators are used in pytorch and already they are hooked up to intel python's numpy implementation: https://github.com/IntelPython/dpnp |
Thanks for the issue Jack. We won't have time to work on this at Codeplay but external contributions are welcomed to improve this! |
Summary
Linear algebra operators in oneMKL lapack that return computation error (e.g. for matrix operations such as inversion (e.g. getri) that may not have a solution) return this error via an exception (
[oneapi::mkl::lapack::computation_error](https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onemkl/source/architecture/architecture#onemkl-lapack-exception-computation-error)
). To achieve this there is a implementation constraint that such functions as getri are synchronous, since they generally don't know this error code until completion. This means that even if (for example) a programmer inputs a matrix that does have a valid solution for the given operation (e.g. a matrix that is non-singular for an inverse operation), the user is forced to have all work wait on the return of this synchronous operation to check for an error code that is irrelevant. This affects are large proportion (maybe most?) of oneMKL lapacks most computationally intensive functions. Any workload using these functions will be severely bottlenecked with respect to asynchronous performance.However native libraries such as cusolver (that oneMKL uses), can return this "computation error" information via a return value that is returned asynchronously. Therefore a change to the oneMKL specification would fix this issue.
Problem statement
Provide asynchronous oneMKL interfaces for Linear algebra operators that currently return "computation error" exceptions.
Details
oneMKL will need to remove the
[oneapi::mkl::lapack::computation_error](https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onemkl/source/architecture/architecture#onemkl-lapack-exception-computation-error)
exception, and replace it with either:The text was updated successfully, but these errors were encountered: