Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread allocation issues with batched density compensation #52

Open
headmeister opened this issue Feb 7, 2022 · 11 comments
Open

Thread allocation issues with batched density compensation #52

headmeister opened this issue Feb 7, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@headmeister
Copy link

headmeister commented Feb 7, 2022

Hello,
I encountered this problem on our computation server, it has a dual socket setup with two cpus each having 64 cores (together 128 cores and 256 threads). I have a dynamic dataset with radial sampling where are 100 frames each having 37 spokes (different trajectories). I wanted to compute the density compensation in a batched way, so that it is faster. When run in the batched form I get errors failing obtaining resources from the libgomp:

libgomp: Thread creation failed: Resource temporarily unavailable
This does not differ if I run the compensation on a CPU (the trajectories are on a CPU device) or they are on a GPU (CUDA).

On the other hand when I run this density compensation in a for loop (each 2D frame at a time), I get the results without an issue.

Also when working with the forward/adjoint operators later on they work in a batched form just fine(although quite slow). This problem is related only to the density compensation function. Might this be an issue with thread allocation specifically in this function?

Unfortunately I am able to reproduce this problem only at this specific setup, I tried it on a different PC and there the batch computation works just fine but quite slow even on a GPU. When I looked at the GPU usage it was very low, while the CPU usage was quite high. I first thought that there is some precomputation going on when the trajectories are not the same for each frame, although it might be related to the fact that it is creating too many threads in that case. I am not that familiar with this implementation. I am used to planning the NUFFT as in for example the gpuNUFFT or pynufft and then applying the transforms and here it seems different, as no planning step is taken ahead of applying the forward/adjoint operations.

@mmuckley
Copy link
Owner

mmuckley commented Feb 7, 2022

Hello @headmeister, you're correct that there is no planning stage. For the FFT step torchkbnufft puts all of that on the PyTorch FFT functions.

Could you let me know what operating systems you're two machines use and what version of PyTorch you have? It's been a very long time since I worked on the threading backend, but I do remember observing fairly different characteristics on Linux, macOS, and Windows.

For what it's worth, we use multi-threading largely because some of the subroutines that torchkbnufft calls do not have efficient multithreading for the specific problems we have with NUFFT. For these cases we manually chop up the trajectory ourselves and do thread management over the chopped up trajectory. The rules for the distribution were tuned for a 2D, radial trajectory and testing has been on 8-core to 40-core systems.

@headmeister
Copy link
Author

headmeister commented Feb 7, 2022

Working setup was:
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (all worked)
512 GB RAM
AMD Epyc 24 core
GTX 2080 Ti

Non functional setup was
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (neither worked)
1 TB RAM
AMD Epyc 2x 64 Core
2x Nvidia A100

I tried also a windows machine with AMD ryzen 8 core CPU and no issues there as well

What was commonfor all of them was however, that when processing a set with multiple trajectories across the batch dimension the benefit of using a GPU was basically zero, it is CPU bound for some reason. When working in the way as in your performance check, that is using single trajectory for multiple input k-spaces, the GPU acceleration was very noticable...

@mmuckley
Copy link
Owner

mmuckley commented Feb 7, 2022

Okay so to summarize you have:

100 time points
37 spokes per time point

How many coils? And which version of torchkbnufft? And I see above you said this is 2D.

@headmeister
Copy link
Author

headmeister commented Feb 7, 2022

Yes 2D acquisition,
4 coils (its a preclinical Bruker machine)
The acquired data size is 128x4x37x100 (pts x coils x spokes x time points) sampled with Golden angle 2D radial sequence.

Torchkbnufft version newest from pip that should be 1.3.0

@mmuckley
Copy link
Owner

Hello @headmeister, my understanding is we have two issues: the density compensation error and the slow batched NUFFTs.

For (1), this is an obscure error that I haven't encountered before. Have you tried reducing the number of available threads by setting something like OMP_NUM_THREADS=8?

For (2) I think a problem might be you have many tiny problems - even more tiny than we normally expect for dynamic. The threads might not be getting enough work. I do not observe any differences in performance for any thread counts when running on CPU, possibly because the overhead of creating and destroying threads is similar to the computation work. For GPU, I can actually get a 60% speedup by using 8 threads instead of 40 by setting OMP_NUM_THREADS=8.

All of my tests were on the current main version of torchkbnufft on Linux.

Let me know if any of these help you.

@headmeister
Copy link
Author

Hello,
(1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:

/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches

    # collect the results
    results = [torch.jit.wait(future) for future in futures]
               ~~~~~~~~~~~~~~ <--- HERE
    coef = torch.cat([result[0] for result in results])
    arr_ind = torch.cat([result[1] for result in results])

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: invalid device context

Again this error applies to the DCF computation only, but I think it might be the same issue as before, only it is reported differently.

(2) setting the number of threads did not have much impact on our system. But this may be due to the fact, that the cores are generally weaker when there is many of them. So, while it may lower the thread creation/destruction overhead, the cores are weak to compute fast enough.
I redid my algorithm also using the standard FFT (cartesian data as input) and the GPU usage rises to almost 100 % and the CPU usage goes basically to zero. The acceleration over CPU is around the factor of 10 for the whole algorithm, which is usually reported as expected.
As I see this, isn't there somewhere during the NUFFT computation some handover between the CPU and GPU, even though the data is on the GPU? This might take quite some time, especially when it is performed for each thread independently.

@mmuckley
Copy link
Owner

The package shouldn't change the device of the tensors at all after creation - it should use the device of the tensors that you pass in. New tensors should be created on the target device. The only CPU-GPU communication is sending computation instructions to the GPU. You can see the logic for this in the interp function: https://github.com/mmuckley/torchkbnufft/blob/main/torchkbnufft/_nufft/interp.py. You could try dropping some print statements in there to see if any Tensor types are mismatched.

For my 40-core system the cores are also a little slow. It is also a 2-socket system if I recall correctly. In terms of hardware the primary difference would be AMD - I don't have an AMD system to test on.

@mmuckley mmuckley added the bug Something isn't working label Feb 14, 2022
@wouterzwerink
Copy link

Hello, (1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:

/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches

    # collect the results
    results = [torch.jit.wait(future) for future in futures]
               ~~~~~~~~~~~~~~ <--- HERE
    coef = torch.cat([result[0] for result in results])
    arr_ind = torch.cat([result[1] for result in results])

RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: CUDA driver error: invalid device context

I get this same error during backwards passes over a batched NUFFT, but in table_interp_adjoint.
The error dissapears when using:

torch._C._jit_set_profiling_mode(False)

I do not quite understand why, but maybe this helps with finding the bug.

@mlaves
Copy link

mlaves commented May 25, 2023

@wouterzwerink I have the same error for batched inputs with varying sizes in torchkbnufft.KbNufft.

@mmuckley
Copy link
Owner

@wouterzwerink @mlaves please open a separate issue - that error is not related to thread allocation.

@mlaves
Copy link

mlaves commented May 25, 2023

@mmuckley Thanks, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants