-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia_smi 插件报错 #1056
Comments
打开nvidia_timeout 呢 |
配置里的 query_timeout = "5s" 吗?这个时开着的。 |
那不应该, 超时后,会调用kill命令 |
kill 掉也没用啊,下次查询还会卡住,再 查询 再卡住再 kill。循环往复,从显卡故障后就没有监控数据上报了 |
那你应该修复故障啊,源头挂了,你要采集器帮你修? |
我的意思是监控采集不到故障信息,无法做对应告警配置 |
no data 可以用absent之类的函数 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Relevant config.toml
Logs from categraf
System info
Ubuntu 22.04
Docker
No response
Steps to reproduce
1.开启 nvidia_smi 插件
2. 正常有监控数据
3. 显卡出问题了,掉卡了,通常表现为 nvidia_smi 命令卡住出不来,或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。
...
Expected behavior
有显卡掉卡时监控数据,其他正常的卡的监控数据可以继续正常上报,
Actual behavior
不再上报显卡相关的监控数据
Additional info
No response
The text was updated successfully, but these errors were encountered: