[FEAT] Improved Resource Allocation #400

shinybrar · 2022-11-10T17:12:28Z

Problem

Currently the CANFAR Kubernetes cluster allows for creation of jobs with following underlying resources:

CPU [1, 16]
Memory [1, 192] GB
GPU [0, 28]

To better optimize the cluster usage, we need could employ one of the following strategies:

Cluster-Wide Reservation

By default, we can choose to reserve resources with parity to the unused resources.

For example, if we have a cluster, with 100 CPUs, 1000 GB of memory we can reserve 4 GB of memory for each vacant CPU. Here is a nominal example:

Cluster Utilization:
CPU: 70%
Memory: 100%

In a scenario like this, we always keep a stop gap of vacant cores * 4 GB of memory on reserve in the cluster. This will allow us to run CPU intensive jobs, while the cluster is starved for memory.

Improved Job Reservation

Currently, when a job is spawned, it is given a reserved a fixed, maximum amount of resources. e.g. If spawn a job with 4 CPUs and 64 GB of memory, those resources are removed from the cluster. The reality is that the job may not need all of those resources at all times. We can improve the reservation strategy by allowing the jobs to define a minimum and maximum amount of resources they need. This will allow the cluster to better utilize the resources.

For a job described below, the user is gaurenteed 1 CPU and 16 GB of memory and is allowed to use a maximum of 4 CPUs and 64 GB of memory.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
spec:
    template:
        spec:
        containers:
        - name: my-job
            image: my-job:latest
            resources:
            limits:
                cpu: 4
                memory: 64Gi
            requests:
                cpu: 1
                memory: 16Gi
        restartPolicy: Never

shinybrar assigned shinybrar and brianmajor Nov 10, 2022

shinybrar added the enhancement New feature or request label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Improved Resource Allocation #400

[FEAT] Improved Resource Allocation #400

shinybrar commented Nov 10, 2022

[FEAT] Improved Resource Allocation #400

[FEAT] Improved Resource Allocation #400

Comments

shinybrar commented Nov 10, 2022

Problem

Cluster-Wide Reservation

Improved Job Reservation