-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156
base: main
Are you sure you want to change the base?
Conversation
test:mpich/ch4/gpu |
src/mpl/include/mpl_gpu_cuda.h
Outdated
@@ -22,4 +23,6 @@ typedef volatile int MPL_gpu_event_t; | |||
|
|||
#define MPL_GPU_DEV_AFFINITY_ENV "CUDA_VISIBLE_DEVICES" | |||
|
|||
#define MPL_gpu_device_id_to_handle(...) cuDeviceGet(__VA_ARGS__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than an MPL_gpu_
wrapper, maybe we can directly call cuDeviceGet
? The usage (in fi_mr_regattr
) is always inside the #ifdef MPL_HAVE_CUDA
anyway, which makes it OK to use CUDA-specific calls. If we still want the MPL abstraction, I think that we can name MPL_cuda_
for CUDA-specific functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. This version just calls cuDeviceGet
in ofi_impl.h
.
The abstraction of device id as an integer is a good abstraction above MPL. The opaqueness of MPL_gpu_device_handle_t, on the other hand, makes it useless, since the upperlayer can't do anything with it. For ZE, simply expose the internal device id. For new GPU runtimes that does support integer device ids, we can similarly do a map as in mpl_gpu_ze.c.
Libfabric docs say that the value of the cuda field in the regattr struct is the device handle gotten from cuDeviceGet, not the ordinal. Fixes pmodels#7148.
test:mpich/ch4/gpu/ofi |
Pull Request Description
Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes #7148.
This PR picks [842f4cf] from #7049 in order to avoid confusion between device ids and handles.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.