Major refactor to improve performance of exporter #204
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of each (sub-)collector discovering necessary resources (cgroups, procs, etc) individually, centralise the operation using cgroup sub-collector.
cgroup sub-collector will discover all the active cgroups, their children and the processes contained within cgroups and pass this information to downstream sub-collectors. This ensures that we can use the same information across different sub-collectors without needing to re-collect the same information again.
IMPORTANT thing that we learned is that reading files in /proc is very expensive as it involves kernel taking a lot of spin locks. Before we are fetching the relevant processes for perf collector by tranversing the /proc file system, reading cgroup for each process and building this info. This turned out to be very expensive and hence we collect all the info now from cgroups using cgroup sub-collector once and pass it to downstream components.
Remove support for getting GPU ordinals using a file created by SLURM prolog. The exporter needs quite few privs now and there is no point in supporting this functionality. It is easier for operators to use the method of getting ordinals from env vars as it involves less configuration
Correct prometheus and Grafana config docs.
Remove SLURM prolog/epilog config files and systemd service file that will no longer be supported.