Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mulit-GPU and CUDA Stream Support #60

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Conversation

nickjbrowning
Copy link
Collaborator

OPSA supported.

@nickjbrowning nickjbrowning changed the title WIP for mulit-GPU and CUDA Streams. mulit-GPU and CUDA Stream Support May 2, 2024
@nickjbrowning nickjbrowning changed the title mulit-GPU and CUDA Stream Support WIP mulit-GPU and CUDA Stream Support May 2, 2024
@nickjbrowning nickjbrowning added the WIP work in progress label May 2, 2024
@nickjbrowning nickjbrowning removed the WIP work in progress label May 3, 2024
@nickjbrowning nickjbrowning changed the title WIP mulit-GPU and CUDA Stream Support mulit-GPU and CUDA Stream Support May 3, 2024
Copy link
Collaborator

@frostedoyster frostedoyster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. Is there a way we can test if this is working (perhaps from torch?). I remember that it wasn't that easy with sphericart. Perhaps we can launch 10 small mops operations on 10 different CUDA streams and see if we get a speed-up. Would that make any sense @nickjbrowning?

#ifndef MOPS_CUDA_ENABLED
C10_THROW_ERROR(ValueError, "MOPS was not compiled with CUDA support " + A.device().str());
#else
c10::cuda::CUDAGuard deviceGuard{A.device()};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this deviceGuard do? I see that it's not being used explicitly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sets the current CUDA device to be the same one as A.device()

Copy link
Collaborator Author

@nickjbrowning nickjbrowning May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no easy way that I can see for us to test whether a kernel has launched on a specific stream from PyTorch. We can probably do this with the CUDA API but that seems a bit overkill.

Comment on lines +1 to +5
#ifdef MOPS_CUDA_ENABLED
#include <c10/cuda/CUDAGuard.h>
#include <c10/cuda/CUDAStream.h>
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that mops-torch/src/sap.cpp has not been modified past the headers (i.e. the stream is not actually taken into account), and the same is true for opsaw and sasaw. Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this for SAP. OPSAW and SASAW aren't implemented yet (SASAW is in a different branch) so when I get back to that I'll make it consistent.

@frostedoyster frostedoyster mentioned this pull request May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants