-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train baseline - memory consumption increases with iterations #157
Comments
Oh, that's bad! Are you actually seeing this fail as a result? Normally I see a RAM increase but then eventually RLlib somehow clears it up. However, this seems like more of a RLlib issue than an issue with this library (I suspect). Would you mind reposting this in their github issues? |
Yes. Unfortunately, it typically fails on most longer runs. I'll repost on the RLLib Github as well. I'm currently getting this issue after simply cloning the current repo and running train_baseline. |
Hi, that's really good to know! Thank you for updating us on this. I'll examine it as well when I get a chance, but I'm suspicious it's an rllib issue rather than something on our end. I don't think there's any memory that's persisted across environment rollouts. |
Sounds good. For now, could you possibly share your current environment/setup, where you have RLLib clearing up storage automatically. Possibly as a docker container? |
Upon running the train_baseline command in python3, the workers seems to consume RAM memory with each iteration without properly freeing it up.
On some runs, I have seen a memory consumption increase of 40Mb/iterations, which when scaled to e.g 10000 iterations, becomes 400 GB memory. Since this is in RAM, it makes longer experiments impossible to run.
Additionally, please note that this seems to be different from object store in ray as upon termination, object store size was considerably small ~ 20 MB. However, each worker.__PolicyEvaluator() had ~2GB storage allocated with multiple such workers present.
The text was updated successfully, but these errors were encountered: