-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GA3C source code has High CPU usage causing System freeze or crash #20
Comments
That's an interesting observation. I've tested the code on a Maxwell TITAN X myself and didn't observe such behavior. Can you please share the version of your libraries (python, TensorFlow, cuda, ...) . My (blind) guess is that this is a problem with TensorFlow. It would be great if you share your Motherboard spec since PCI-E is the bottleneck here. Two side notes:
|
It is also interesting understanding if the number of agents is increasing during training. That may explain the increase in CPU usage.
…Sent from my iPhone
Sory ForSpell Ing hErRRors :)
On Mar 19, 2017, at 11:36 AM, Mohammad Babaeizadeh <[email protected]<mailto:[email protected]>> wrote:
That's an interesting observation. I've tested the code on a Maxwell TITAN X myself and didn't observe such behavior. Can you please share the version of your libraries (python, TensorFlow, cuda, ...) . My (blind) guess is that this is a problem with TensorFlow. It would be great if you share your Motherboard spec since PCI-E is the bottleneck here.
Two side notes:
1. The low memory usage is due small model size. Please note that neither A3C nor GA3C have any "experience memory" so they do not utilize GPU memory as an storage and the only stored object is the model itself. But I will be interested in your GPU-utilization (check with nvidia-smi command).
2. The current version of the code is single-GPU so you currently cannot utilize more than one-GPU.
-
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#20 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APNcGlo3t01noF41xG_hzFurrFEpCLQtks5rnXWQgaJpZM4MhwML>.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
|
@ifrosio that's a very good point. @developeralgo8888 please try with |
Please find attached. i restarted the run and it has started increasing as we go . |
with DYNAMIC_SETTINGS=False , The CPU remains stable but you do have memory leak . The memory keeps increasing until the system freeze i have attached the snapshots which are roughly 12 hours apart |
The code runs fine but leaks CPU and Memory and will crush your system . I am using Glances diagnostic or monitoring tool ( pip install glances ) . You will notice that if you leave your code running for a long time the CPU context switches increases substantially and the CPU & Memory keeps increasing until your code hangs or crushes . CPU usage increased from 6.7% to 64% and Memory from 10% to 79% at that point it caused the system freeze. When i look at the Nvidia TITAN X ( Maxwell --12 GB mem) usage it is only using about 300 MB out 12 GB. So it seems while most of the heavy lifting should be offloaded to the GPU in this case it does not seem to be the case. I have 8 x TITAN Maxwell GPUs with 2 x Intel Xeon 2660 v3
(2 CPU with total 40 CPU Cores ) with 128GB of DDR4 memory and i can use any of them . Still i get same results , the CPU will keep increasing
Any insights?
Other original A3C or various hybrid ( CPU & GPU ) versions seem to offload most of the heavy lifting to GPU and causes no system freezes but not with GA3C
Testing it on various amounts of data and games
The text was updated successfully, but these errors were encountered: