CUDA out of memory error. #135

eyalol · 2021-02-01T10:25:34Z

Cheers Hugues,
I've managed to modify the DataLoader to fit with DALES dataset, however, when I'm training the network, I'm getting:

RuntimeError: CUDA out of memory. Tried to allocate 4.73 GiB (GPU 0; 15.78 GiB total capacity; 7.98 GiB already allocated; 3.54 GiB free; 11.08 GiB reserved in total by PyTorch)

I'm using a 16GB GPU as needed in Arjun's KPConv for dales, I'd like to ask you which hyperparameters i might change to solve this issue, as the one who built this network.

Thank you and Good day.
Eyal

HuguesTHOMAS · 2021-02-01T17:43:51Z

You can reduce the batch size a little but I would not recommend using a value smaller than 4 or 5.
The other parameter that you can reduce is the input radius or in the same manner, you can increase the first subsampling dl. The number of points in each input sphere is directly controlled by the ratio between input radius and subsampling dl.

eyalol · 2021-02-01T18:26:18Z

You can reduce the batch size a little but I would not recommend using a value smaller than 4 or 5.
The other parameter that you can reduce is the input radius or in the same manner, you can increase the first subsampling dl. The number of points in each input sphere is directly controlled by the ratio between input radius and subsampling dl.

I don't think it's a good idea, I should probably modify the data loader to load the point cloud in tiles. any advice regarding this topic would be wonderful.

Thanks for everything Hugues.

HuguesTHOMAS · 2021-02-01T18:35:51Z

Why wouldn't it be a good idea? I don't see how tiles would solve your issues. Are you using my data loader which picks random spheres in the point clouds? If yes, I can definitely tell you that reducing the sphere radius is a good idea. I often found it improved the performances

eyalol · 2021-02-01T18:51:22Z

As of today, I've tried training the entire DALES dataset on my 16GB GPU, which went into CUDA memory error in the backward call. afterward tried training just one of the samples, and still got the CUDA memory error, so I assume tuning hyperparameters wouldn't really make a difference with loading the entire set (Correct me if I'm wrong).
Thus I can tell that I have severe RAM issue, so splitting to tiles would probably solve this issue, just like patches in images, I
I plan on loading them one by one to the GPU and run the network each time, instead of loading the whole dataset into memory. If I'm not wrong I think you even mentioned this idea in one of the issues regarding the DALES dataset.
Would like to hear what you think.

eyalol · 2021-02-03T16:01:24Z

Hello again! @HuguesTHOMAS I've tried running the network on just one ply file of dales, containing about 1 million points, with:
batch_num = 4
in_radius = 20
num_kernel_points = 15
conv_radius = 2.5
deform_radius = 6

And still got the CUDA error on my 16GB GPU.
Does it makes sense to you? or do you think something's wrong with my GPU?

Really appreciate the help,
Thanks,
Eyal

HuguesTHOMAS · 2021-02-03T19:58:50Z

What is the value of first_subsampling_dl?

eyalol · 2021-02-03T20:27:55Z

@HuguesTHOMAS it's 0.250

Best Regards,
Eyal

HuguesTHOMAS · 2021-02-03T22:25:31Z

ok. This is a fare value, can you try to reduce in_radius to 10.0 and see if there is still a CUDA OOM error?

Can you also print the error message that you get?

eyalol · 2021-02-12T19:32:40Z

@HuguesTHOMAS Sorry for the late response, I ran it now with in_radius = 10.0 as you said, and it works.
Now I should probably need to check how I'm gonna split it to tiles for the training process. Do you agree?
After all I don't want the hardware to dictate which hyper-parameters i can use.

HuguesTHOMAS · 2021-02-12T20:24:20Z

Hardware always dictates which hyperparameter you are using when you do Deep Learning. You can see in_radius as the zoom of an image if you were doing 2D CNNs and first_subsampling_dl would be the resolution of the image. Obviously, you have to adapt these values to your hardware. Or reduce the memory consumption by using a smaller network.

eyalol · 2021-02-12T20:41:24Z

You are absolutely right. I'm just confused, I've seen in Argun's repo that he could train KPConv on the whole DALES dataset with hardware the same as I have, and I can't even train on a single example, (out of 30~ training examples.) though using the same hyper-parameters as he did, maybe he split the data to tiles, although I've looked into his code and didn't see anything like it.

So I'm saying - in order to overcome this gap, I'll split the data into tiles, as often done in 2D deep learning, instead of letting the hyper-parameters dictate that i can train just on one example, or train on a shallow network with small in_radius that could cause less good results.

HuguesTHOMAS · 2021-02-16T16:46:42Z

What I have trouble understanding is how you think that splitting the data into tiles will help you... The input pipeline should already take small sub-spheres in the big DALES point clouds which are the same as tiles except they are not fixed in advance. If you use tiles you will do exactly the same and pick small sub-tiles in the big point clouds, and you will have the same memory issues if the tiles are too big, so you will have to reduce the tile size exactly like you have to reduce the sphere radius right now. Or maybe there is something I did not understand in what you are planning to do?

eyalol · 2021-02-16T17:35:06Z

Ok. I'll explain; If I'm not wrong, in many domains of deep learning, let's consider for a moment 2D vision tasks, images are cropped into patches and being preprocessed in the CPU, while afterwards being fed one by one to the GPU, considering storage aspects, you ''fill'' the gpu storage wise only one patch at a time. the rest is queued in the CPU.

So take that idea here, you load all the data into the CPU, and take each time a subsphere.todevice( ), run it through the net, and so fourth for a specific number of iterations, this way every time you fill the GPU a little, evading the CUDA error.

I'm I'm wrong here i'd like to have explanation, I'm not yet to be a Deep Learning expert, but an entusitasic student :)
Best Wishes!
Eyal

HuguesTHOMAS · 2021-02-16T19:27:49Z

Ok so if I understand well, what you are referring to is close to the concept of minibatch, but instead of dividing a batch before sending it to the GPU, you divide your input along its spatial dimensions. I have not heard of many methods doing that during training, so I am not an expert, but I would say that this would raise several issues (border effects, backpropagation compatibility, slow computations...). It would also, in my opinion, be very hard to implement in the KPConv framework, as the convolutions are not as easily defined as in images, and as the batches have variable size. You could end up spending one or two months implementing something like that, and in the end, have a chance that this would not even improve the performances.

For all these reasons I would not recommend you follow this idea. That being said, it is your work, your project so you can do whatever you want.

Best,
Hugues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory error. #135

CUDA out of memory error. #135

eyalol commented Feb 1, 2021

HuguesTHOMAS commented Feb 1, 2021

eyalol commented Feb 1, 2021

HuguesTHOMAS commented Feb 1, 2021

eyalol commented Feb 1, 2021

eyalol commented Feb 3, 2021

HuguesTHOMAS commented Feb 3, 2021

eyalol commented Feb 3, 2021

HuguesTHOMAS commented Feb 3, 2021

eyalol commented Feb 12, 2021

HuguesTHOMAS commented Feb 12, 2021

eyalol commented Feb 12, 2021

HuguesTHOMAS commented Feb 16, 2021

eyalol commented Feb 16, 2021

HuguesTHOMAS commented Feb 16, 2021

CUDA out of memory error. #135

CUDA out of memory error. #135

Comments

eyalol commented Feb 1, 2021

HuguesTHOMAS commented Feb 1, 2021

eyalol commented Feb 1, 2021

HuguesTHOMAS commented Feb 1, 2021

eyalol commented Feb 1, 2021

eyalol commented Feb 3, 2021

HuguesTHOMAS commented Feb 3, 2021

eyalol commented Feb 3, 2021

HuguesTHOMAS commented Feb 3, 2021

eyalol commented Feb 12, 2021

HuguesTHOMAS commented Feb 12, 2021

eyalol commented Feb 12, 2021

HuguesTHOMAS commented Feb 16, 2021

eyalol commented Feb 16, 2021

HuguesTHOMAS commented Feb 16, 2021