Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Get the correct TrainJob components using get_job() API #2348

Open
andreyvelich opened this issue Dec 11, 2024 · 0 comments
Open

[SDK] Get the correct TrainJob components using get_job() API #2348

andreyvelich opened this issue Dec 11, 2024 · 0 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Dec 11, 2024

What you would like to be added?

As we discussed, currently get_job() API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: #2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.
Therefore, users can see unexpected logs while using the Kubeflow Training SDK.

We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant