Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cgroups v2 pkg #23

Merged
merged 7 commits into from
Jan 1, 2024
Merged

Use cgroups v2 pkg #23

merged 7 commits into from
Jan 1, 2024

Conversation

mahendrapaipuri
Copy link
Owner

  • Use containerd's cgroup package to get metrics from v2 as well just like we do for v1
  • Merge nVIDIA collector with SLURM collector
  • Update tests and add more test scenarios

* Use containerd cgroups pkg to read cgroups stats

* Add slurm_job to metric name to be consistent across collectors

* Add cpu and memory PSI metrics

* Walk through pids of cgroup only to get job details

Signed-off-by: Mahendra Paipuri <[email protected]>
* Test when slurm prolog files are not present

* This tests if we are able to read procfs correctly and get env vars

* Update test fixtures

Signed-off-by: Mahendra Paipuri <[email protected]>
Signed-off-by: Mahendra Paipuri <[email protected]>
Signed-off-by: Mahendra Paipuri <[email protected]>
* Merge GPU jobID map with slurm collector. This is more logical organization instead of having a separate collector

* Dont report swap and PSI metrics by default. They can be enabled  using CLI flag

* Add a hidden flag to force cgroups version for testing

* Refacorting of certain receivers to be more clean

* Add more unit tests to cover more scenarios

* Add test fixtures to be able to unit test

* Add more e2e test scenarios

Signed-off-by: Mahendra Paipuri <[email protected]>
* Iteration over map is undefined in go and not reproducible

* To ensure we always have same behaviour we use int as map index and iterate over range

* This is done to avoid unit test failures as order in slice gpuOrdinals is important in cmp

Signed-off-by: Mahendra Paipuri <[email protected]>
@mahendrapaipuri mahendrapaipuri merged commit 7a2fe8a into main Jan 1, 2024
5 checks passed
@mahendrapaipuri mahendrapaipuri deleted the use_cgroups_v2_pkg branch January 1, 2024 12:36
@mahendrapaipuri mahendrapaipuri added enhancement New feature or request maintenance General maintenance labels Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maintenance General maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant