You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the CANFAR Kubernetes cluster allows for creation of jobs with following underlying resources:
CPU [1, 16]
Memory [1, 192] GB
GPU [0, 28]
To better optimize the cluster usage, we need could employ one of the following strategies:
Cluster-Wide Reservation
By default, we can choose to reserve resources with parity to the unused resources.
For example, if we have a cluster, with 100 CPUs, 1000 GB of memory we can reserve 4 GB of memory for each vacant CPU. Here is a nominal example:
Cluster Utilization:
CPU: 70%
Memory: 100%
In a scenario like this, we always keep a stop gap of vacant cores * 4 GB of memory on reserve in the cluster. This will allow us to run CPU intensive jobs, while the cluster is starved for memory.
Improved Job Reservation
Currently, when a job is spawned, it is given a reserved a fixed, maximum amount of resources. e.g. If spawn a job with 4 CPUs and 64 GB of memory, those resources are removed from the cluster. The reality is that the job may not need all of those resources at all times. We can improve the reservation strategy by allowing the jobs to define a minimum and maximum amount of resources they need. This will allow the cluster to better utilize the resources.
For a job described below, the user is gaurenteed 1 CPU and 16 GB of memory and is allowed to use a maximum of 4 CPUs and 64 GB of memory.
Problem
Currently the CANFAR Kubernetes cluster allows for creation of jobs with following underlying resources:
To better optimize the cluster usage, we need could employ one of the following strategies:
Cluster-Wide Reservation
By default, we can choose to reserve resources with parity to the unused resources.
For example, if we have a cluster, with 100 CPUs, 1000 GB of memory we can reserve 4 GB of memory for each vacant CPU. Here is a nominal example:
Cluster Utilization:
CPU: 70%
Memory: 100%
In a scenario like this, we always keep a stop gap of vacant cores * 4 GB of memory on reserve in the cluster. This will allow us to run CPU intensive jobs, while the cluster is starved for memory.
Improved Job Reservation
Currently, when a job is spawned, it is given a reserved a fixed, maximum amount of resources. e.g. If spawn a job with 4 CPUs and 64 GB of memory, those resources are removed from the cluster. The reality is that the job may not need all of those resources at all times. We can improve the reservation strategy by allowing the jobs to define a minimum and maximum amount of resources they need. This will allow the cluster to better utilize the resources.
For a job described below, the user is gaurenteed 1 CPU and 16 GB of memory and is allowed to use a maximum of 4 CPUs and 64 GB of memory.
The text was updated successfully, but these errors were encountered: