CUDA '2D' Algorithm Questions #14

asglover · 2024-07-11T19:58:27Z

Hello team,

My name is Austin Glover, and I'm currently trying to create fast GPU implementations of the sparse contraction operation that is central to the tensor product. I just found your library and discovered that you've had a lot of good ideas about how to accelerate this operation. One of the ideas in the cuda_extension folder is sparse_accumulation_cuda_kernel2D.cu. If I understand the idea correctly, the idea is to have each thread do the "outer product" of the "feature dimension" in the inner most loop. So if you had inputs like (batch, in1_feature_dim, in1_irrep_dim) and (batch, in2_feature_dim, in2_irrep_dim) each thread would create one batch of (in1_feature_dim, in2_feature_dim, output_irrep_dim). This is a clever way to increase the "arithmetic intensity" of the "sparse operation". I was curious if you found this approach to be successful in providing a speedup?

This whole project is a really cool and clever approach to the problem!

Austin

The text was updated successfully, but these errors were encountered:

Luthaf · 2024-07-12T08:00:26Z

Hi,

The code in this repo has been superseded by https://github.com/lab-cosmo/mops, containing the same operation as here as well as a couple of others, and is where all new development is happening.

I was not personally involved in writing the CUDA code in this repo, but I'll try to find someone how understands it to answer your questions!

nickjbrowning · 2024-07-12T08:24:40Z

Hey Austin,

I wasn't involved in this project but I am pretty experienced with writing these ops, so I might be able to offer some insight. If you take a look at the function definition here:

sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu

Lines 244 to 262 in 04068ed

    
           /* Forward propagation function 
        
            * It computes the sparce accumulation assuming the last  
        
            * dimension being the active-sparse dimension.   
        
            *  
        
            * The operation can be summarized in this equation: 
        
            *  
        
            * output_tensor[isample,ifeatures,idx_output[m]] = X1[[isample,ifeatures,idx_1[m]]] * X2[[isample,ifeatures,idx_1[m]]] * multipliers[m] 
        
            * 
        
            * input:  
        
            *      X1[nsample,nfeatures,m1]: First input tensor 
        
            *      X2[nsample,nfeatures,m2]: Second input Tensor 
        
            *      idx_output: Tensor with the indeces of the third dimension of the output_tensor 
        
            *      output_size: third dimension of the output Tensor 
        
            *      idx_1: Tensor with the indeces of the third dimension of X1 
        
            *      idx_2: Tensor with the indeces of the third dimension of X2 
        
            *      multipliers: Tensor containing the multipliers for the sparse accumulations 
        
            * output:  
        
            *      output_tensor[nsample,nfeatures,output_size]: output Tensor 
        
            */

you'll see that it takes as input:

X1[nsample,nfeatures,m1]
X2[nsample,nfeatures,m2]

and outputs:

output_tensor[nsample,nfeatures,output_size]

The code does not additionally do an outer product over the features of X1 and features of X2, which is what you've written in your question.

I would say that the performance of this code mostly comes from this loop:

sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu

Lines 96 to 109 in 04068ed

    
           for (int z = 0 ; z < nz ; ++z){ 
        
               z_output = buffer_idx_output[z]; 
        
               if (z_old != z_output) { 
        
                   output_final[z_old] = now; 
        
                   now = 0; 
        
                   z_old = z_output; 
        
               } 
        
               z_X1 = buffer_idx_X1[z]; 
        
               z_X2 = buffer_idx_X2[z]; 
        
               now += buffer_X1_final[z_X1] 
        
                                              * buffer_X2_final[z_X2] 
        
                                              * buffer_multipliers[z]; 
        
           }; 
        
           output_final[z_old] = now;

where it just keeps a per-thread running sum over the output index list, and only writes out when the next index in the list is different to the current. This prevents needing to use shared memory + atomics.

asglover · 2024-07-12T22:34:08Z

Hey Austin,

I wasn't involved in this project but I am pretty experienced with writing these ops, so I might be able to offer some insight. If you take a look at the function definition here:

sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu

Lines 244 to 262 in 04068ed

/* Forward propagation function

* It computes the sparce accumulation assuming the last

* dimension being the active-sparse dimension.

*

* The operation can be summarized in this equation:

*

* output_tensor[isample,ifeatures,idx_output[m]] = X1[[isample,ifeatures,idx_1[m]]] * X2[[isample,ifeatures,idx_1[m]]] * multipliers[m]

*

* input:

* X1[nsample,nfeatures,m1]: First input tensor

* X2[nsample,nfeatures,m2]: Second input Tensor

* idx_output: Tensor with the indeces of the third dimension of the output_tensor

* output_size: third dimension of the output Tensor

* idx_1: Tensor with the indeces of the third dimension of X1

* idx_2: Tensor with the indeces of the third dimension of X2

* multipliers: Tensor containing the multipliers for the sparse accumulations

* output:

* output_tensor[nsample,nfeatures,output_size]: output Tensor

*/

you'll see that it takes as input:
X1[nsample,nfeatures,m1]
X2[nsample,nfeatures,m2]
and outputs:

output_tensor[nsample,nfeatures,output_size]

okay, I wasn't sure if all the nfeatures necessarily had to be the same.
Also on line 250 I think it should be:
X2[[isample,ifeatures,idx_2[m]]]
instead of
X2[[isample,ifeatures,idx_1[m]]]

The code does not additionally do an outer product over the features of X1 and features of X2, which is what you've written in your question.

That's interesting, so is the nfeatures dimension any different than the batch dimension? (assuming you're just the same CG contraction for every input)
Are out1 and out2 equivalent?

x1 = tensor(batch, nfeatures, m1)
x2 = tensor(batch, nfeatures, m2)

out_1 = sparse_accumulate(x1,x2, ... )

x1.reshape(-1, 1, m1)
x2.reshape(-1, 1, m2)

out_2 = sparse_accumulate(x1,x2, ...)

out2 = out2.reshape(batch, nfeatures, -1)

I'll try and get the repo running and confirm this, but it doesn't look like the "2D" kernel is the one implemented. It looks like the "non 2D" cuda kernel is the one that is linked, and it currently has a lot commented out. I'll try to get things running then come back

I would say that the performance of this code mostly comes from this loop:

sparse_accumulation/sparse_accumulation/cuda_extension/sparse_accumulation_cuda_kernel2D.cu

Lines 96 to 109 in 04068ed

for (int z = 0 ; z < nz ; ++z){

z_output = buffer_idx_output[z];

if (z_old != z_output) {

output_final[z_old] = now;

now = 0;

z_old = z_output;

}

z_X1 = buffer_idx_X1[z];

z_X2 = buffer_idx_X2[z];

now += buffer_X1_final[z_X1]

* buffer_X2_final[z_X2]

* buffer_multipliers[z];

};

output_final[z_old] = now;

where it just keeps a per-thread running sum over the output index list, and only writes out when the next index in the list is different to the current. This prevents needing to use shared memory + atomics.

Yeah, I noticed that the Sparse-Accumulation-of-Products in the mops use the shared memory atomics every time and was wondering if that was a performance driven change, or if it was just a simpler approach. It sounds like you're saying this mehtod check where you would do as much local accumulation in the thread as possible before writing out to shared memory is faster.

Do you know what the conclusion was of this project?

asglover · 2024-07-12T23:30:42Z

Also, I take it back I see that the 2D algorithm is referenced in the setup.py while the regular algorithm is referenced in the jit.py in the cuda_extension folder. I'm assuming that the 2D algorithm is the functional one, as the other file is largely commented out.

Can you recommended a PyTorch / Python version on which this can successfully build? I'm getting import errors when I try and run the tests for CPU and GPU. (which may very easily be my fault)

___________________________________________________ ERROR collecting tests/test_cpp_contiguous.py ____________________________________________________
ImportError while importing test module '/global/u1/a/aglover/sparse_accumulation/tests/test_cpp_contiguous.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cpp_contiguous.py:2: in <module>
    from sparse_accumulation.clebsch_gordan import get_real_clebsch_gordan, ClebschGordan
sparse_accumulation/__init__.py:1: in <module>
    from .unified_operation import accumulate
sparse_accumulation/unified_operation.py:2: in <module>
    import sparse_accumulation_active_dim_last_cpp
E   ImportError: /global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/site-packages/sparse_accumulation_active_dim_last_cpp.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs

____________________________________________________ ERROR collecting tests/test_cpp_jit_cuda.py _____________________________________________________
ImportError while importing test module '/global/u1/a/aglover/sparse_accumulation/tests/test_cpp_jit_cuda.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cpp_jit_cuda.py:8: in <module>
    from sparse_accumulation.clebsch_gordan import ClebschGordan, get_real_clebsch_gordan
sparse_accumulation/__init__.py:1: in <module>
    from .unified_operation import accumulate
sparse_accumulation/unified_operation.py:2: in <module>
    import sparse_accumulation_active_dim_last_cpp
E   ImportError: /global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/site-packages/sparse_accumulation_active_dim_last_cpp.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs

My major software versions are:

python                    3.9.19          h0755675_0_cpython    conda-forge
pytorch                   2.3.1           cuda120_py39h17b67e0_300    conda-forge

Luthaf · 2024-07-13T08:04:05Z

Can you recommended a PyTorch / Python version on which this can successfully build?

You should use https://github.com/lab-cosmo/mops for this, the algorithm corresponding to this repo is the sparse accumulation of products. This repo is no longer worked on by us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA '2D' Algorithm Questions #14

CUDA '2D' Algorithm Questions #14

asglover commented Jul 11, 2024

Luthaf commented Jul 12, 2024

nickjbrowning commented Jul 12, 2024 •

edited

Loading

asglover commented Jul 12, 2024 •

edited

Loading

asglover commented Jul 12, 2024 •

edited

Loading

Luthaf commented Jul 13, 2024 •

edited

Loading

CUDA '2D' Algorithm Questions #14

CUDA '2D' Algorithm Questions #14

Comments

asglover commented Jul 11, 2024

Luthaf commented Jul 12, 2024

nickjbrowning commented Jul 12, 2024 • edited Loading

asglover commented Jul 12, 2024 • edited Loading

asglover commented Jul 12, 2024 • edited Loading

Luthaf commented Jul 13, 2024 • edited Loading

nickjbrowning commented Jul 12, 2024 •

edited

Loading

asglover commented Jul 12, 2024 •

edited

Loading

asglover commented Jul 12, 2024 •

edited

Loading

Luthaf commented Jul 13, 2024 •

edited

Loading