-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303
Conversation
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
Older versions of gcc struggle with attributes on namespaces
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
This should be resolved. We want our extensions to be fully implemented on all backends. If this part of the API cannot be implemented on CUDA, we should remove it from the extension spec. However, I thought it could be implemented by simply keeping track in software whether P2P has been enabled for each device. Don't we need that anyway in order to diagnose errors correctly? |
@JackAKirk I reported one of the issues. Is there some test program for me to execute ? Thanks. |
Yes, here are the two main use cases that I have been testing: You can turn the access on and off and observe how this affects the P2P usage with nsys, but obviously you need two devices. Note that one interesting thing is that for the cuda backend the kernel access is unidirectional, but the P2P copies are bidirectional: if I enable p2p access of device 1 from device 0 then I can do P2P copies both ways, but P2P access only the direction I specified. I did not find any Nvidia documentation that explains this. The P2P query function and how errors are handled is still subject to change in the specification. |
Thanks! I tried to run the modified program, but the result of the P2P memory copy is not right after P2P copy is enabled. Not sure if this is reproducible.
|
I've fixed it here: https://github.com/intel/llvm-test-suite/compare/intel...JackAKirk:llvm-test-suite:p2p_examples?expand=1 |
You are noting that https://en.cppreference.com/w/cpp/algorithm/copy |
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
The updated p2p example in SYCL might be helpful for you. |
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@gmlueck : we will add the support to the L0 backend in a follow-up patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on L0 and UR common code.
It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called. |
Is someone scheduled to do the remaining work soon? If yes, we can delay moving the spec until that happens. If there are no immediate plans, we should move the document in this PR so that CUDA users know that extension is available. |
I don't know if someone is scheduled to add the impl for l0 soon or not. It should be very simple but will require a system of two gpus or more for verification. I can move the document in this PR. |
Signed-off-by: JackAKirk <[email protected]>
std::ignore = param_value_size_ret; | ||
|
||
die("piextPeerAccessGetInfo not " | ||
"implemented in L0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than die
, shouldn't we return some sort of "false" status, indicating that P2P isn't available (yet)? That way we can document this extension as "supported", and we can enable end-to-end tests on all backends.
Same for the other backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the following changes to the API specification:
-
Update the "Status" section using the wording in the template.
-
Add a section "Backend support status" noting that this extension is supported only for the CUDA backend. I'd suggest wording like:
This extension is currently implemented in DPC++ for all devices and backends, however, only the CUDA backend allows peer to peer memory access. Other backends report
false
from theext_oneapi_can_access_peer
query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I've made these changes now.
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
|
||
This extension is currently implemented in DPC++ for all GPU devices and | ||
backends, however, only the CUDA backend allows peer to peer memory access. | ||
Other backends report false from the ext_oneapi_can_access_peer query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other backends report false from the ext_oneapi_can_access_peer query. | |
Other backends report false from the `ext_oneapi_can_access_peer query`. |
Code font is better here.
Signed-off-by: JackAKirk <[email protected]>
Any more reviews for this? |
@smaslov-intel can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @intel/llvm-gatekeepers would merge
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <[email protected]> Co-authored-by: Nicolas Miller <[email protected]> Co-authored-by: JackAKirk <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <[email protected]> Co-authored-by: Nicolas Miller <[email protected]> Co-authored-by: JackAKirk <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>
This implements the current extension doc from #6104 in the CUDA backend only.
Fixes #7543.
Fixes #6749.