-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Multi-tenancy for tests and garbage collection #9179
Conversation
f34e2d7
to
aae74a2
Compare
aae74a2
to
16f4454
Compare
65799c4
to
79fd28a
Compare
16f4454
to
c6dc27a
Compare
Hey @pablo-garay , this is an proposal how we could make our CI pipeline a bit more stable. We should/could isolate folders that are being written to, and garbage-collect them by The only caveat I can see is that templating our tests might become a little bit more tricky. But I don't think it would be a blocker. See #9177 for reference |
@@ -270,10 +270,15 @@ jobs: | |||
- name: Checkout repository | |||
uses: actions/checkout@v4 | |||
- run: | | |||
mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just make it be:
mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \
--input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \
--output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo;
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for another Cleanup step. Make it all in just one script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would help us to keep a more unified test case template, but we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.
not a big issue, what do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would need a different (maybe scheduled?) workflow that cleans up directories that weren't cleaned up due to workflow cancellation.
Ah good point. Let's do as you have it here then. Scheduled workflow will also have same issue given other workflow can be running that test & it will be synchonization mess.
Leave it as it is now then :) and please kindly do this analogous change for all the other tests prefixed "L2_Community_LLM_Checkpoints_tests_"
I.e. all the tests that are affected by this:
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
rm -rvf /mnt/datadrive/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights
Fantastic proposal We need this for all tests whose name start with prefix: L2_Community_LLM_Checkpoints_tests_ |
bee87e7
to
ff4014d
Compare
24f2f3c
to
226245e
Compare
if: "always()" | ||
run: | | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/ | ||
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" | ||
if: "failure()" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to do for all the tests that I mentioned to you offline.
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
32950d4
to
3b0377c
Compare
run: | | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; | ||
rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/ | ||
- name: Cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why there are two cleanups
Will merge & can do follow up PR to sort this out if/as needed
* ci: Multi-tenancy for tests and garbage collection Signed-off-by: Oliver Koenig <[email protected]> * add remaining testcases Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Boxiang Wang <[email protected]>
* ci: Multi-tenancy for tests and garbage collection Signed-off-by: Oliver Koenig <[email protected]> * add remaining testcases Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Jan Lasek <[email protected]>
* ci: Multi-tenancy for tests and garbage collection Signed-off-by: Oliver Koenig <[email protected]> * add remaining testcases Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]>
What does this PR do ?
More robust CI pipeline by creating isolated directories for job-outputs.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information