-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing assignment of service-account for launcher #1898
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Are the tests maybe a bit flaky? Tests that cover code I did not touch have failed, I think they are not related to my changes at all. |
Hi @tenzen-y, thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough. Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners. Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore. I do not see it as my responsibility to fix the test stability of your application. How can we continue here? |
@rpemsel Thanks for the result of your investigation! Maybe, this test failure is caused by there is this commit behind the master branch. Can you rebase this PR? Thanks for your effort. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rpemsel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I rebased on master but except for the volcano gang scheduler, not other changes were introduced |
Tests are failing consistently |
Yes, they are, we can also say that tests are failing on a deterministic base. I have already explained, why they are failing. By the way: If you look to all the other Pull Request: They are all failing. |
Integration tests succeeded in #1905. |
Yes, for one of 11 pull requests tests are succeeding. I don't think that all the other are wrong. Please run the tests locally on your system and you will exactly see the effects that I told you today. It would be much easier to debug the tests if they would output the logs from the deployed Kubernetes workload from time to time, so one would have an idea what is going wrong. Then it would maybe even be possible to get an idea of what is going wrong by just looking at the output of the CICD pipeline. |
I suspect that this change breaks the whole controller logic or integration tests since I faced the error once I applied the change the same as this PR to #1905. |
Also, integration tests succeeded in #1907. |
I don't agree to this suspection: If I ran the e2e tests from main branch locally, then the tests still fail. Unfortunately I cannot run the tests from Github Actions. Again I am asking you, how my change can break pytorch and mxjob jobs, if my change only applies to mpijobs? That does not make any sense to me. Do you want me to send over the details of how to run the tests locally, so you can reproduce my results? |
Sorry, but the tests are flaky/inconsistent in the CICD pipeline: This time all I increased the timeouts for the integration tests. They are succeeding locally, but still fail on the CICD pipeline. Please validate for yourself. |
Hi @rpemsel, please can you try to rebase your PR and remove increased timeout for E2Es. |
Pull Request Test Coverage Report for Build 6231245371
💛 - Coveralls |
I have rebased to the latest state of master and also reset the timeouts. |
@@ -1035,7 +1035,6 @@ func (jc *MPIJobReconciler) newLauncher(mpiJob *kubeflowv1.MPIJob, kubectlDelive | |||
jc.PodGroupControl.DecoratePodTemplateSpec(podSpec, mpiJob, rt) | |||
} | |||
|
|||
podSpec.Spec.ServiceAccountName = launcherName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to build Training Operator image with your change and MPIJob test failed for me:
pytest sdk/python/test/e2e/test_e2e_mpijob.py --log-cli-level=info --namespace=default -k "test_sdk_e2e"
I think, we can't just remove SA assignment for MPIJob launcher.
MPIJob launcher requires the appropriate RBAC to exec
and access MPIJob worker pods.
Thus, we attach the created ServiceAccount to MPIJob launcher.
MPI Operator experts can comment on this @alculquicondor @tenzen-y @terrytangyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't sound right to remove the SA.
That said, we got rid of it in v2, because we don't use kubectl exec
Correcting author
…/training-operator into launcher-service-account
…/training-operator into launcher-service-account
I took up your reviews and avoided removing the service account. Instead I created another solution, but I will need to open a new pull request, since there is a problem with the author that I committed with for which I will not be able to sign the CLA |
What this PR does / why we need it:
Which issue(s) this PR fixes
This PR fixes the following issue: #1897
Checklist: