You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we discussed, currently get_job() API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: #2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.
Therefore, users can see unexpected logs while using the Kubeflow Training SDK.
We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
As we discussed, currently
get_job()
API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: #2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.Therefore, users can see unexpected logs while using the Kubeflow Training SDK.
We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: