You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The batch jobs (cloud workers) get requirements from the user (if defined) or from the version_config (if not defined)
Local workers hope to have enough, if the users did not specify anything
GPU model, architecture, memory are not meaningfully used (string match when pulling)
Should do:
If the user does not specify any requirements, the defaults from version_config should be used
Do the lambdas even need the version_config then? Check, but apparently only HW and timeout gotten: HW solved by above, timeout could be passed as well (and specifiable by user, and filterable, and sortable)
The batch puller pulls based on the maximum they can offer; based on the requirements, they override the job definition defaults with the right instance type, memory, and vcpu (which does not necessarily match cpu_cores....)
For this, the batch environment needs a mapper instance type => (max cpu cores, gpu model, gpu architecture, gpu memory), and a function that based on the requirements returns the cheapest instance type to use
Local workers work as they do now (the cluster might need a similar mapping as well)
The text was updated successfully, but these errors were encountered:
Also, do we want to have multiple types of GPUs running, i.e., multiple job definitions (and pulling) in AWS Batch?
At submit_job, can specify instance to use, ...
Haydnspass
changed the title
Hardware specs
Bind lambda puller to batch hardware specs
Sep 15, 2023
There is a bit of confusion as to what "Hardware" means at the moment:Maybe we should also have default hardware per version.
Note that memory can be overridden https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch/client/submit_job.html when submitting a job.
Things as they are now:
Should do:
The text was updated successfully, but these errors were encountered: