Skip to content

Prometheus exporter for GPU statistics from nvidia-smi, written in Go. Supports multiple GPUs.

License

Notifications You must be signed in to change notification settings

500farm/prometheus-nvidiasmi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nvidiasmi_exporter

Prometheus exporter for GPU statistics from nvidia-smi, written in Go. Supports multiple GPUs.

In addition to nvidia-smi info, also exports some info about processes using the GPU, docker containers processes belong to, and PCIe error stats.

This version is not dockerized because it requires access to docker daemon and /sys filesystem.

Setup

$ make 
$ sudo make install

After that, the exporter will be started automatically by systemd on startup.

Usage

By default, nvidiasmi_exporter listens on port 9202.

CLI args (specify in /etc/systemd/system/nvidiasmi_exporter.service):

--listen
    Address to listen on (default 9202).

--nvidia-smi-path
    Path to nvidia-smi (default /usr/bin/nvidia-smi).

--update-interval
    How often to run nvidia-smi (default 5s)

--test-file
    Run in test mode (read nvidia-smi xml output from specified file)

VRAM temperatures

To monitor VRAM temperature for RTX 3000 / 4000 series, compile and install https://github.com/500farm/gddr6 as described in its README. Check that /usr/local/bin/gddr6 exists, gives reasonable values and does not crash the system.

Credits

Based on work of Kristoph Junge and Michaël Ferrand.

Example output with annotations

### Driver info
nvidiasmi_info{attached_gpus="1",cuda_version="11.4",driver_version="470.63"} 1.0

### Nvidia_smi info
nvidiasmi_pci_pcie_gen_max{gpu_id="46:00.0"} 3
nvidiasmi_pci_pcie_gen_current{gpu_id="46:00.0"} 3
nvidiasmi_pci_link_width_max_multiplicator{gpu_id="46:00.0"} 16
nvidiasmi_pci_link_width_current_multiplicator{gpu_id="46:00.0"} 16
nvidiasmi_pci_replay_counter{gpu_id="46:00.0"} 0
nvidiasmi_pci_replay_rollover_counter{gpu_id="46:00.0"} 0
nvidiasmi_pci_tx_util_bytes_per_second{gpu_id="46:00.0"} 5.9e+07
nvidiasmi_pci_rx_util_bytes_per_second{gpu_id="46:00.0"} 4.67e+08
nvidiasmi_fan_speed_percent{gpu_id="46:00.0"} 30
nvidiasmi_performance_state_int{gpu_id="46:00.0"} 2
nvidiasmi_fb_memory_usage_total_bytes{gpu_id="46:00.0"} 2.5446842368e+10
nvidiasmi_fb_memory_usage_used_bytes{gpu_id="46:00.0"} 2.4920457216e+10
nvidiasmi_fb_memory_usage_free_bytes{gpu_id="46:00.0"} 5.26385152e+08
nvidiasmi_bar1_memory_usage_total_bytes{gpu_id="46:00.0"} 2.68435456e+08
nvidiasmi_bar1_memory_usage_used_bytes{gpu_id="46:00.0"} 7.340032e+06
nvidiasmi_bar1_memory_usage_free_bytes{gpu_id="46:00.0"} 2.61095424e+08
nvidiasmi_utilization_gpu_percent{gpu_id="46:00.0"} 23
nvidiasmi_utilization_memory_percent{gpu_id="46:00.0"} 1
nvidiasmi_utilization_encoder_percent{gpu_id="46:00.0"} 0
nvidiasmi_utilization_decoder_percent{gpu_id="46:00.0"} 0
nvidiasmi_encoder_session_count{gpu_id="46:00.0"} 0
nvidiasmi_encoder_average_fps{gpu_id="46:00.0"} 0
nvidiasmi_encoder_average_latency{gpu_id="46:00.0"} 0
nvidiasmi_fbc_session_count{gpu_id="46:00.0"} 0
nvidiasmi_fbc_average_fps{gpu_id="46:00.0"} 0
nvidiasmi_fbc_average_latency{gpu_id="46:00.0"} 0
nvidiasmi_gpu_temp_celsius{gpu_id="46:00.0"} 32
nvidiasmi_gpu_temp_max_threshold_celsius{gpu_id="46:00.0"} 98
nvidiasmi_gpu_temp_slow_threshold_celsius{gpu_id="46:00.0"} 95
nvidiasmi_gpu_temp_max_gpu_threshold_celsius{gpu_id="46:00.0"} 93
nvidiasmi_gpu_target_temp_celsius{gpu_id="46:00.0"} 75
nvidiasmi_gpu_target_temp_min_celsius{gpu_id="46:00.0"} 0
nvidiasmi_gpu_target_temp_max_celsius{gpu_id="46:00.0"} 0
nvidiasmi_memory_temp_celsius{gpu_id="46:00.0"} 0
nvidiasmi_gpu_temp_max_mem_threshold_celsius{gpu_id="46:00.0"} 0
nvidiasmi_power_state_int{gpu_id="46:00.0"} 2
nvidiasmi_power_draw_watts{gpu_id="46:00.0"} 112.56999969482422
nvidiasmi_power_limit_watts{gpu_id="46:00.0"} 375
nvidiasmi_default_power_limit_watts{gpu_id="46:00.0"} 350
nvidiasmi_enforced_power_limit_watts{gpu_id="46:00.0"} 375
nvidiasmi_min_power_limit_watts{gpu_id="46:00.0"} 100
nvidiasmi_max_power_limit_watts{gpu_id="46:00.0"} 400
nvidiasmi_clock_graphics_hertz{gpu_id="46:00.0"} 1.695e+09
nvidiasmi_clock_graphics_max_hertz{gpu_id="46:00.0"} 2.1e+09
nvidiasmi_clock_sm_hertz{gpu_id="46:00.0"} 1.695e+09
nvidiasmi_clock_sm_max_hertz{gpu_id="46:00.0"} 2.1e+09
nvidiasmi_clock_mem_hertz{gpu_id="46:00.0"} 9.501e+09
nvidiasmi_clock_mem_max_hertz{gpu_id="46:00.0"} 9.751e+09
nvidiasmi_clock_video_hertz{gpu_id="46:00.0"} 1.515e+09
nvidiasmi_clock_video_max_hertz{gpu_id="46:00.0"} 1.95e+09
nvidiasmi_clock_policy_auto_boost{gpu_id="46:00.0"} 0
nvidiasmi_clock_policy_auto_boost_default{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_gpu_idle{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_applications_clocks_setting{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_sw_power_cap{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_hw_slowdown{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_hw_thermal_slowdown{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_hw_power_brake_slowdown{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_sync_boost{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_sw_thermal_slowdown{gpu_id="46:00.0"} 0
nvidiasmi_clocks_throttle_reason_display_clocks_setting{gpu_id="46:00.0"} 0

### GPU name and UUID
nvidiasmi_gpu_info{device="GA102 [GeForce RTX 3090]",gpu_id="46:00.0",gpu_name="NVIDIA GeForce RTX 3090",gpu_uuid="GPU-325bf28b-e1e1-0628-3678-06673bdb76fd",subsys_device="Device 147d",subsys_vendor="NVIDIA Corporation",vendor="NVIDIA Corporation"} 1.0

### PCIe AER counters
nvidiasmi_aer_counter{aer_type="fatal",gpu_id="46:00.0"} 0
nvidiasmi_aer_counter{aer_type="non-fatal",gpu_id="46:00.0"} 0
nvidiasmi_aer_counter{aer_type="correctable",gpu_id="46:00.0"} 0

### Process/container info
nvidiasmi_process_up{gpu_id="46:00.0",pid="3890000",process_type="C"} 1.0
nvidiasmi_process_used_memory_bytes{gpu_id="46:00.0",pid="3890000",process_type="C"} 2.4917311488e+10
nvidiasmi_process_start_timestamp{pid="3890000"} 1633035666.200000
nvidiasmi_process_container_start_timestamp{pid="3890000"} 1632969100.384305
nvidiasmi_process_info{container_id="/docker/e3b272ff36976664996c4c537816ba712d63725e7b5f6d121e875a1d738c4f4a",container_name="C.1329067",docker_image="pytorch/pytorch",pid="3890000",process_name="/usr/bin/python3.6"} 1.0

About

Prometheus exporter for GPU statistics from nvidia-smi, written in Go. Supports multiple GPUs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 97.7%
  • Makefile 2.3%