Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual memory usage is too large #25

Open
PrototypeNx opened this issue Sep 9, 2021 · 4 comments
Open

Virtual memory usage is too large #25

PrototypeNx opened this issue Sep 9, 2021 · 4 comments

Comments

@PrototypeNx
Copy link

Hello, the the model looks very good, but I encountered some problems when trying to train the model by myself.
My training environment is win10, torch1.8.0+cuda11, rtx3090, 32g memory.
When I use the default config, it will prompt cuda out of memory, but this is a misleading error message, cause there is still a lot of free cuda memory. When I tracked the hardware resources during the training process, I found that the amount of memory submitted before the formal training began to increase, which eventually led to overflow, which means that a huge amount of virtual memory was applied for before the training began. However, when I lower the parameters for normal training, the actual memory usage is very small, and the virtual memory usage is still large, but it will not reach the upper limit that was raised before the training started. I have never seen such a huge virtual memory overhead before, so I consider whether there is a memory leak problem during preload or preprocessing, and whether the program can be better optimized.
Thank you!

@imlixinyang
Copy link
Owner

imlixinyang commented Sep 9, 2021

I don't know the actual reason for your situation. But there are several points may cause the problem in my view:

  1. data_prefetcher in https://github.com/imlixinyang/HiSD/blob/main/core/utils.py. Which can speed up the data loader.
  2. cudnn.benchmark in https://github.com/imlixinyang/HiSD/blob/main/core/train.py.
  3. Each iteration, the choice of modules from HiSD is different from previous single-path framework .
  4. Latent code may be not buffered in the same memory, which can be improved by register_buffer.

The 32GB memory (>2x1080Ti) is enough for config file (celeba-hq256.yaml), so I am superised to hear that it raise OUT OF MEMORY fault. Hope you reproduce the results successfully soon and you are welcomed to share more information or solutions here. I will try my best to help you.

@PrototypeNx
Copy link
Author

Thank you for such a quick reply, I will check what you mentioned.
But I am sorry that I may not express it clearly, the reason for the training failure is more likely to be due to virtual memory rather than cuda memory.
I use RTX 3090 with 24GB of cuda memory and 32GB of RAM. The config uses the default celebA-HQ.yaml, and the virtual memory usage continues to rise before the training starts, reaching 64GB, causing overflow, but the cuda memory and RAM usage is very low. So I had to set the batch_size to 4 for training. At this time, the virtual memory occupies about 40GB. If I want to add a few training attributes, unfortunately the virtual memory overflows again.

@PrototypeNx
Copy link
Author

I switched to an Ubuntu system under the same configuration, and there were no problems during training. I think the reason for the above problem is the different virtual memory allocation mechanism between Linux and Windows. Thank you again for your help!

@imlixinyang
Copy link
Owner

Glad to hear that and you're always welcomed if there are any further problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants