[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

andreyvelich · 2024-12-10T00:18:45Z

What you would like to be added?

As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): #2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.

During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.

However, we can consider various approaches, for instance:

Using distributed cache.
Using kubectl cp.

Why is this needed?

This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-12-10T00:18:58Z

cc @shravan-achar @akshaychitneni

andreyvelich added kind/feature lifecycle/needs-triage labels Dec 10, 2024

andreyvelich added area/sdk and removed lifecycle/needs-triage labels Dec 10, 2024

andreyvelich mentioned this issue Dec 10, 2024

KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

andreyvelich commented Dec 10, 2024 •

edited

Loading

andreyvelich commented Dec 10, 2024

[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

Comments

andreyvelich commented Dec 10, 2024 • edited Loading

What you would like to be added?

Why is this needed?

Love this feature?

andreyvelich commented Dec 10, 2024

andreyvelich commented Dec 10, 2024 •

edited

Loading