Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia_smi 插件报错 #1056

Open
Derek-zd opened this issue Sep 19, 2024 · 8 comments
Open

nvidia_smi 插件报错 #1056

Derek-zd opened this issue Sep 19, 2024 · 8 comments

Comments

@Derek-zd
Copy link

Relevant config.toml

# interval = 15

# exec local command
# e.g. nvidia_smi_command = "nvidia-smi"
nvidia_smi_command = "nvidia-smi"

# exec remote command
# nvidia_smi_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null SSH_USER@SSH_HOST nvidia-smi"

# Comma-separated list of the query fields.
# You can find out possible fields by running `nvidia-smi --help-query-gpus`.
# The value `AUTO` will automatically detect the fields to query.
query_field_names = "AUTO"

# query_timeout is used to set the query timeout to avoid the delay of date collection.
query_timeout = "5s"

Logs from categraf

Sep 19 16:23:52 zj-4090-59 categraf[79833]: 2024/09/19 16:23:52 metrics_agent.go:276: E! failed to init input: local.nvidia_smi error: unexpected query field: vgpu_driver_capability.heterogenous_multivGPU

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

1.开启 nvidia_smi 插件
2. 正常有监控数据
3. 显卡出问题了,掉卡了,通常表现为 nvidia_smi 命令卡住出不来,或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。
20240919-181818

  1. categraf 会不再上报 显卡监控数据,导致 告警失效。
    ...

Expected behavior

有显卡掉卡时监控数据,其他正常的卡的监控数据可以继续正常上报,

Actual behavior

不再上报显卡相关的监控数据

Additional info

No response

@kongfei605
Copy link
Collaborator

打开nvidia_timeout 呢

@Derek-zd
Copy link
Author

img_v3_02et_5caedd98-552d-4db7-829d-bb3c3254286h
这次执行 nvidia-smi --query-gpu 时的错误,红色部分

@Derek-zd
Copy link
Author

Derek-zd commented Sep 20, 2024

打开nvidia_timeout 呢

配置里的 query_timeout = "5s" 吗?这个时开着的。
nvidia-smi 命令卡住值 timeout 等一年也是卡住的。可以忽略这种卡住的情况,插件就是无法工作,无法处理的。
但是上面我刚发的图的这个,我靠,插件调用的这个命令有返回,但是个错误的好像也没法处理了。无解了

@kongfei605
Copy link
Collaborator

那不应该, 超时后,会调用kill命令

@Derek-zd
Copy link
Author

那不应该, 超时后,会调用kill命令

kill 掉也没用啊,下次查询还会卡住,再 查询 再卡住再 kill。循环往复,从显卡故障后就没有监控数据上报了

@kongfei605
Copy link
Collaborator

那你应该修复故障啊,源头挂了,你要采集器帮你修?

@Derek-zd
Copy link
Author

我的意思是监控采集不到故障信息,无法做对应告警配置

@kongfei605
Copy link
Collaborator

no data 可以用absent之类的函数

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@kongfei605 @Derek-zd and others