Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactor to improve performance of exporter #204

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

mahendrapaipuri
Copy link
Owner

  • Instead of each (sub-)collector discovering necessary resources (cgroups, procs, etc) individually, centralise the operation using cgroup sub-collector.

  • cgroup sub-collector will discover all the active cgroups, their children and the processes contained within cgroups and pass this information to downstream sub-collectors. This ensures that we can use the same information across different sub-collectors without needing to re-collect the same information again.

  • IMPORTANT thing that we learned is that reading files in /proc is very expensive as it involves kernel taking a lot of spin locks. Before we are fetching the relevant processes for perf collector by tranversing the /proc file system, reading cgroup for each process and building this info. This turned out to be very expensive and hence we collect all the info now from cgroups using cgroup sub-collector once and pass it to downstream components.

  • Remove support for getting GPU ordinals using a file created by SLURM prolog. The exporter needs quite few privs now and there is no point in supporting this functionality. It is easier for operators to use the method of getting ordinals from env vars as it involves less configuration

  • Correct prometheus and Grafana config docs.

  • Remove SLURM prolog/epilog config files and systemd service file that will no longer be supported.

* Instead of each (sub-)collector discovering necessary resources (cgroups, procs, etc) individually, centralise the operation using cgroup sub-collector.

* cgroup sub-collector will discover all the active cgroups, their children and the processes contained within cgroups and pass this information to downstream sub-collectors. This ensures that we can use the same information across different sub-collectors without needing to re-collect the same information again.

* IMPORTANT thing that we learned is that reading files in /proc is very expensive as it involves kernel taking a lot of spin locks. Before we are fetching the relevant processes for perf collector by tranversing the /proc file system, reading cgroup for each process and building this info. This turned out to be very expensive and hence we collect all the info now from cgroups using cgroup sub-collector once and pass it to downstream components.

* Remove support for getting GPU ordinals using a file created by SLURM prolog. The exporter needs quite few privs now and there is no point in supporting this functionality. It is easier for operators to use the method of getting ordinals from env vars as it involves less configuration

* Correct prometheus and Grafana config docs.

* Remove SLURM prolog/epilog config files and systemd service file that will no longer be supported.

Signed-off-by: Mahendra Paipuri <[email protected]>
@mahendrapaipuri mahendrapaipuri added enhancement New feature or request maintenance General maintenance labels Oct 27, 2024
@mahendrapaipuri mahendrapaipuri merged commit 099c6b7 into main Oct 29, 2024
15 checks passed
@mahendrapaipuri mahendrapaipuri deleted the refactor_proc_discoverer branch October 29, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maintenance General maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant