Gpu config api #684

at88mph · 2024-08-15T17:21:30Z

Configurable GPU using gpu-count:<gpu-vendor>
- Set the NVIDIA_CUDA_MAJOR_VERSION environment variable in User Sessions from querying Kubernetes
Cleanup to prevent needing to modify each job launch file each time using a single template
Added lookup for SKAHA_SERVICE_ID environment variable locally for integration tests to run

…nto gpu-config-api

…nto gpu-config-api # Conflicts: # deployment/helm/skaha/Chart.yaml # deployment/helm/skaha/templates/_helpers.tpl

brianmajor · 2024-08-29T23:15:58Z

skaha/src/main/java/org/opencadc/skaha/session/PostAction.java

+
+        try {
+            final int majorNVIDIACUDAVersion = CommandExecutioner.getMajorNvidiaCudaGPUVersion();
+            jobLaunchString = setConfigValue(jobLaunchString, SOFTWARE_GPU_NVIDIA_CUDA_MAJOR_VERSION, Integer.toString(majorNVIDIACUDAVersion));


Users already have access to the GPU version through their software, but this may be useful, not sure.

The general idea is to allow users to land on the right GPU (brand, version, gpu-core count). I think the ideas from ExecutionBroker are useful here and will help us align with that potential integration.

We currently expose at /context the content of k8s-resources. So this is just a static config reflecting the underlying capabilities of the cluster. Ideally, those values should come from the cluster instead. However, that is probably beyond the scope of this story. Also beyond the scope is adding the 'brokering' part of client interaction.

So I think, for now at least, the story is to simply let users specify, through API params, those 3 gpu conditions. I haven't gone through this whole PR yet but I'm guessing that a lot of that is already there. Let's chat about it tomorrow.

Thanks. According to CADC-13476, we wanted the Major CUDA version supplied. This way scripts can look it up.

Also, there should be two (2) parameters specified; the gpu-type and the gpus (count) parameter.

deployment/helm/skaha/skaha-config/launch-carta.yaml

…nto gpu-config-api

…latform into gpu-config-api

…nto gpu-config-api

…nto gpu-config-api # Conflicts: # deployment/helm/skaha/Chart.yaml # deployment/helm/skaha/skaha-config/launch-desktop.yaml # deployment/helm/skaha/values.yaml # skaha/VERSION # skaha/src/intTest/java/org/opencadc/skaha/DesktopAppLifecycleTest.java # skaha/src/intTest/java/org/opencadc/skaha/ExpiryTimeRenewalTest.java # skaha/src/intTest/java/org/opencadc/skaha/ImagesTest.java # skaha/src/intTest/java/org/opencadc/skaha/SessionLifecycleTest.java # skaha/src/intTest/java/org/opencadc/skaha/SessionUtil.java # skaha/src/main/java/org/opencadc/skaha/session/PostAction.java # skaha/src/main/java/org/opencadc/skaha/session/SessionAction.java

at88mph added 15 commits August 2, 2024 14:50

First pass to allow gpu-type parameter

7f765d7

Merge branch 'fixes' into gpu-config-api

471603e

Merge branch 'fixes-init-users' into gpu-config-api

414fca7

Merge branch 'fixes-launch-config' into gpu-config-api

d20ec2b

Merge branch 'main' of https://github.com/opencadc/science-platform i…

7dd1918

…nto gpu-config-api

Cleanup.

d94ee0f

Chart update. Depends on ephem-storage-config branch being merged.

3173446

Add command to get current CUDA driver version.

558751a

Consolidate helper templates and add GPU Version to environment.

b46455f

Merge branch 'main' of https://github.com/opencadc/science-platform i…

ecde569

…nto gpu-config-api

Merge branch 'main' of https://github.com/opencadc/science-platform i…

82d4f5f

…nto gpu-config-api # Conflicts: # deployment/helm/skaha/Chart.yaml # deployment/helm/skaha/templates/_helpers.tpl

Cleanup

5ac994c

Typo fix.

3508fce

Merge branch 'main' of https://github.com/opencadc/science-platform i…

f20f92a

…nto gpu-config-api # Conflicts: # deployment/helm/skaha/Chart.yaml # deployment/helm/skaha/templates/_helpers.tpl

Image version update.

0adb12b

brianmajor reviewed Aug 29, 2024

View reviewed changes

brianmajor reviewed Aug 30, 2024

View reviewed changes

deployment/helm/skaha/skaha-config/launch-carta.yaml Outdated Show resolved Hide resolved

at88mph added 2 commits September 3, 2024 09:39

Merge branch 'main' of https://github.com/opencadc/science-platform i…

40ea9e2

…nto gpu-config-api

Merge branch 'gpu-config-api' of https://github.com/at88mph/science-p…

fd46606

…latform into gpu-config-api

at88mph marked this pull request as ready for review September 3, 2024 16:45

at88mph added 2 commits September 9, 2024 09:46

Merge branch 'main' of https://github.com/opencadc/science-platform i…

ec56b61

…nto gpu-config-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu config api #684

Gpu config api #684

at88mph commented Aug 15, 2024

brianmajor Aug 29, 2024

at88mph Aug 30, 2024

Gpu config api #684

Are you sure you want to change the base?

Gpu config api #684

Conversation

at88mph commented Aug 15, 2024

brianmajor Aug 29, 2024

Choose a reason for hiding this comment

at88mph Aug 30, 2024

Choose a reason for hiding this comment