-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread allocation issues with batched density compensation #52
Comments
Hello @headmeister, you're correct that there is no planning stage. For the FFT step Could you let me know what operating systems you're two machines use and what version of PyTorch you have? It's been a very long time since I worked on the threading backend, but I do remember observing fairly different characteristics on Linux, macOS, and Windows. For what it's worth, we use multi-threading largely because some of the subroutines that |
Working setup was: Non functional setup was I tried also a windows machine with AMD ryzen 8 core CPU and no issues there as well What was commonfor all of them was however, that when processing a set with multiple trajectories across the batch dimension the benefit of using a GPU was basically zero, it is CPU bound for some reason. When working in the way as in your performance check, that is using single trajectory for multiple input k-spaces, the GPU acceleration was very noticable... |
Okay so to summarize you have: 100 time points How many coils? And which version of |
Yes 2D acquisition, Torchkbnufft version newest from pip that should be 1.3.0 |
Hello @headmeister, my understanding is we have two issues: the density compensation error and the slow batched NUFFTs. For (1), this is an obscure error that I haven't encountered before. Have you tried reducing the number of available threads by setting something like For (2) I think a problem might be you have many tiny problems - even more tiny than we normally expect for dynamic. The threads might not be getting enough work. I do not observe any differences in performance for any thread counts when running on CPU, possibly because the overhead of creating and destroying threads is similar to the computation work. For GPU, I can actually get a 60% speedup by using 8 threads instead of 40 by setting All of my tests were on the current Let me know if any of these help you. |
Hello, /torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches
RuntimeError: The following operation failed in the TorchScript interpreter. Again this error applies to the DCF computation only, but I think it might be the same issue as before, only it is reported differently. (2) setting the number of threads did not have much impact on our system. But this may be due to the fact, that the cores are generally weaker when there is many of them. So, while it may lower the thread creation/destruction overhead, the cores are weak to compute fast enough. |
The package shouldn't change the device of the tensors at all after creation - it should use the device of the tensors that you pass in. New tensors should be created on the target device. The only CPU-GPU communication is sending computation instructions to the GPU. You can see the logic for this in the interp function: https://github.com/mmuckley/torchkbnufft/blob/main/torchkbnufft/_nufft/interp.py. You could try dropping some print statements in there to see if any Tensor types are mismatched. For my 40-core system the cores are also a little slow. It is also a 2-socket system if I recall correctly. In terms of hardware the primary difference would be AMD - I don't have an AMD system to test on. |
I get this same error during backwards passes over a batched NUFFT, but in table_interp_adjoint.
I do not quite understand why, but maybe this helps with finding the bug. |
@wouterzwerink I have the same error for batched inputs with varying sizes in |
@wouterzwerink @mlaves please open a separate issue - that error is not related to thread allocation. |
@mmuckley Thanks, will do. |
Hello,
I encountered this problem on our computation server, it has a dual socket setup with two cpus each having 64 cores (together 128 cores and 256 threads). I have a dynamic dataset with radial sampling where are 100 frames each having 37 spokes (different trajectories). I wanted to compute the density compensation in a batched way, so that it is faster. When run in the batched form I get errors failing obtaining resources from the libgomp:
libgomp: Thread creation failed: Resource temporarily unavailable
This does not differ if I run the compensation on a CPU (the trajectories are on a CPU device) or they are on a GPU (CUDA).
On the other hand when I run this density compensation in a for loop (each 2D frame at a time), I get the results without an issue.
Also when working with the forward/adjoint operators later on they work in a batched form just fine(although quite slow). This problem is related only to the density compensation function. Might this be an issue with thread allocation specifically in this function?
Unfortunately I am able to reproduce this problem only at this specific setup, I tried it on a different PC and there the batch computation works just fine but quite slow even on a GPU. When I looked at the GPU usage it was very low, while the CPU usage was quite high. I first thought that there is some precomputation going on when the trajectories are not the same for each frame, although it might be related to the fact that it is creating too many threads in that case. I am not that familiar with this implementation. I am used to planning the NUFFT as in for example the gpuNUFFT or pynufft and then applying the transforms and here it seems different, as no planning step is taken ahead of applying the forward/adjoint operations.
The text was updated successfully, but these errors were encountered: