-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA '2D' Algorithm Questions #14
Comments
Hi, The code in this repo has been superseded by https://github.com/lab-cosmo/mops, containing the same operation as here as well as a couple of others, and is where all new development is happening. I was not personally involved in writing the CUDA code in this repo, but I'll try to find someone how understands it to answer your questions! |
Hey Austin, I wasn't involved in this project but I am pretty experienced with writing these ops, so I might be able to offer some insight. If you take a look at the function definition here: sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu Lines 244 to 262 in 04068ed
you'll see that it takes as input:
and outputs:
The code does not additionally do an outer product over the features of X1 and features of X2, which is what you've written in your question. I would say that the performance of this code mostly comes from this loop: sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu Lines 96 to 109 in 04068ed
where it just keeps a per-thread running sum over the output index list, and only writes out when the next index in the list is different to the current. This prevents needing to use shared memory + atomics. |
okay, I wasn't sure if all the nfeatures necessarily had to be the same.
That's interesting, so is the nfeatures dimension any different than the batch dimension? (assuming you're just the same CG contraction for every input) x1 = tensor(batch, nfeatures, m1) out_1 = sparse_accumulate(x1,x2, ... ) x1.reshape(-1, 1, m1) out_2 = sparse_accumulate(x1,x2, ...) out2 = out2.reshape(batch, nfeatures, -1) I'll try and get the repo running and confirm this, but it doesn't look like the "2D" kernel is the one implemented. It looks like the "non 2D" cuda kernel is the one that is linked, and it currently has a lot commented out. I'll try to get things running then come back
Yeah, I noticed that the Do you know what the conclusion was of this project? |
Also, I take it back I see that the Can you recommended a PyTorch / Python version on which this can successfully build? I'm getting import errors when I try and run the tests for CPU and GPU. (which may very easily be my fault)
My major software versions are:
|
You should use https://github.com/lab-cosmo/mops for this, the algorithm corresponding to this repo is the sparse accumulation of products. This repo is no longer worked on by us. |
Hello team,
My name is Austin Glover, and I'm currently trying to create fast GPU implementations of the sparse contraction operation that is central to the tensor product. I just found your library and discovered that you've had a lot of good ideas about how to accelerate this operation. One of the ideas in the
cuda_extension
folder issparse_accumulation_cuda_kernel2D.cu
. If I understand the idea correctly, the idea is to have each thread do the "outer product" of the "feature dimension" in the inner most loop. So if you had inputs like(batch, in1_feature_dim, in1_irrep_dim)
and(batch, in2_feature_dim, in2_irrep_dim)
each thread would create one batch of(in1_feature_dim, in2_feature_dim, output_irrep_dim)
. This is a clever way to increase the "arithmetic intensity" of the "sparse operation". I was curious if you found this approach to be successful in providing a speedup?This whole project is a really cool and clever approach to the problem!
The text was updated successfully, but these errors were encountered: