-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs fail due to memory usage #88
Comments
@rdvelazquez I've been looking through dask/dask-searchcv#33 and playing a bit with memory_profiler. Aiming to put up a WIP with some monitoring tools by the end of the week. |
I think currently the instances are 2 GB. What about increasing them to 8 GB? Paging @kurtwheeler who I think can make the change. |
We're actually at 8 GB already, for both instances. How much do we think we'll need? I can bump them up to 16 GB if really needed, although that's starting to get a bit expensive. A cheaper alternative could be to switch to one ml-worker that's large, and then just have a second smaller instance to run the second instance of the core-service for high availability. Do the ml-workers only run one job at a time? Is it possible that one of them is picking up more than one job at a time and this is causing the memory issues? |
I can run the version of notebook 2 from the machine-learning repo on my PC that has 8 GB of RAM without issue (and no swapping). I wonder what the available RAM on the AWS instance is; is there a way to easily find that out? Also I think the AWS instance is using Docker to install the requirements.txt from PIP as opposed to Conda (not sure if this may affect either memory overhead or classifier memory usage). Another potential difference may be how the data is downloaded in notebook 1 on AWS.
The ml-workers should be only running one job at a time even when multiple requests come in at once. I think we've also reproduced this problem enough that it's unlikely there were multiple requests occurring at once each time. |
I would think that pip vs. Conda shouldn't make any difference in memory overhead and even if it did it should be negligible. However I won't say I'm 100% sure on that, but it probably shouldn't be the first thing we investigate. I double checked the available RAM on the AWS instance and it looks to in fact be 8 GB:
That memory is being shared between the |
Any updates on the memory issue? @patrick-miller is planning to migrate the revised machine-learning version of notebook 2 to the ml-workers repo (tagging #110). Maybe after the revised notebook is in production we can try a query with all samples to check if the revised notebook mitigated the memory issue and, if not, evaluate increasing the RAM to 16 GB... at least while we look for another solution? |
Did we increase the RAM? Is this still an issue? |
When fitting models on all disease types, it's common for the job to exceed its memory allotment and fail.
We can increase the instance size as a first step. If that becomes cost prohibitive, we can consider changes to our
dask-searchcv
configuration. Currently, we use the defaultcache_cv=True
:This really speeds things up, so setting
cache_cv=False
would not be ideal.The text was updated successfully, but these errors were encountered: