Major refactor to improve performance of exporter #204

mahendrapaipuri · 2024-10-27T09:06:11Z

Instead of each (sub-)collector discovering necessary resources (cgroups, procs, etc) individually, centralise the operation using cgroup sub-collector.
cgroup sub-collector will discover all the active cgroups, their children and the processes contained within cgroups and pass this information to downstream sub-collectors. This ensures that we can use the same information across different sub-collectors without needing to re-collect the same information again.
IMPORTANT thing that we learned is that reading files in /proc is very expensive as it involves kernel taking a lot of spin locks. Before we are fetching the relevant processes for perf collector by tranversing the /proc file system, reading cgroup for each process and building this info. This turned out to be very expensive and hence we collect all the info now from cgroups using cgroup sub-collector once and pass it to downstream components.
Remove support for getting GPU ordinals using a file created by SLURM prolog. The exporter needs quite few privs now and there is no point in supporting this functionality. It is easier for operators to use the method of getting ordinals from env vars as it involves less configuration
Correct prometheus and Grafana config docs.
Remove SLURM prolog/epilog config files and systemd service file that will no longer be supported.

* Instead of each (sub-)collector discovering necessary resources (cgroups, procs, etc) individually, centralise the operation using cgroup sub-collector. * cgroup sub-collector will discover all the active cgroups, their children and the processes contained within cgroups and pass this information to downstream sub-collectors. This ensures that we can use the same information across different sub-collectors without needing to re-collect the same information again. * IMPORTANT thing that we learned is that reading files in /proc is very expensive as it involves kernel taking a lot of spin locks. Before we are fetching the relevant processes for perf collector by tranversing the /proc file system, reading cgroup for each process and building this info. This turned out to be very expensive and hence we collect all the info now from cgroups using cgroup sub-collector once and pass it to downstream components. * Remove support for getting GPU ordinals using a file created by SLURM prolog. The exporter needs quite few privs now and there is no point in supporting this functionality. It is easier for operators to use the method of getting ordinals from env vars as it involves less configuration * Correct prometheus and Grafana config docs. * Remove SLURM prolog/epilog config files and systemd service file that will no longer be supported. Signed-off-by: Mahendra Paipuri <[email protected]>

mahendrapaipuri added enhancement New feature or request maintenance General maintenance labels Oct 27, 2024

mahendrapaipuri merged commit 099c6b7 into main Oct 29, 2024
15 checks passed

mahendrapaipuri deleted the refactor_proc_discoverer branch October 29, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactor to improve performance of exporter #204

Major refactor to improve performance of exporter #204

mahendrapaipuri commented Oct 27, 2024

Major refactor to improve performance of exporter #204

Major refactor to improve performance of exporter #204

Conversation

mahendrapaipuri commented Oct 27, 2024