Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind lambda puller to batch hardware specs #37

Open
5 tasks
nolan1999 opened this issue Sep 13, 2023 · 2 comments
Open
5 tasks

Bind lambda puller to batch hardware specs #37

nolan1999 opened this issue Sep 13, 2023 · 2 comments
Assignees

Comments

@nolan1999
Copy link
Member

nolan1999 commented Sep 13, 2023

There is a bit of confusion as to what "Hardware" means at the moment:

  • the user can specify the hardware they need, so as to filter the workers
  • version_config defines how much hardware a version needs, as specs for the AWS job definition
  • the AWS job puller's config also has hardware specs, to pull jobs. This is kind of a duplicate of the ones in version_config...~~

Maybe we should also have default hardware per version.

Note that memory can be overridden https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch/client/submit_job.html when submitting a job.

Things as they are now:

  • The user might specify hardware requirements
  • The batch jobs (cloud workers) get requirements from the user (if defined) or from the version_config (if not defined)
  • Local workers hope to have enough, if the users did not specify anything
  • GPU model, architecture, memory are not meaningfully used (string match when pulling)

Should do:

  • If the user does not specify any requirements, the defaults from version_config should be used
  • Do the lambdas even need the version_config then? Check, but apparently only HW and timeout gotten: HW solved by above, timeout could be passed as well (and specifiable by user, and filterable, and sortable)
  • The batch puller pulls based on the maximum they can offer; based on the requirements, they override the job definition defaults with the right instance type, memory, and vcpu (which does not necessarily match cpu_cores....)
  • For this, the batch environment needs a mapper instance type => (max cpu cores, gpu model, gpu architecture, gpu memory), and a function that based on the requirements returns the cheapest instance type to use
  • Local workers work as they do now (the cluster might need a similar mapping as well)
@nolan1999
Copy link
Member Author

nolan1999 commented Sep 13, 2023

Also, do we want to have multiple types of GPUs running, i.e., multiple job definitions (and pulling) in AWS Batch?
At submit_job, can specify instance to use, ...

@Haydnspass Haydnspass changed the title Hardware specs Bind lambda puller to batch hardware specs Sep 15, 2023
@nolan1999
Copy link
Member Author

nolan1999 commented Dec 23, 2023

Will need somehow mapper from gpu_archi to instances...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants