You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A node crashed.
It was running Solana version 1.16.21-jito with the --cuda flag.
The operating system is Ubuntu 22.04.3 LTS
The CPU -- Threadripper PRO 5995WX 64-Cores
There were 2 GPUs -- NVIDIA GeForce RTX 4090
The RAM in the machine is ECC
The node ran very good on this machine with --cuda flag for a long time (about 6 months), the machine was upgraded to 1.16.21 two days ago and it did work since that upgrade until last night when it crashed.
From my monitoring it looks like the crash happened directly on the epoch boundary.
note: once crashed the process could not be killed, server required a reboot. Probably because crash was in GPU code.
I know GPUs are out-of-favour since some time, but I believe they are still supported -- let me know if that is not the case.
A similar machine with two GeForce RTX 3090 did not crash.
Here is the end of the log. If you need the full log, let me know and I will get it.
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.010135821Z INFO solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=199i streamer-send-host_count=1117i streamer-send-bytes_total=7033190i streamer-send-pkt_count_total=8215i streamer-send-host_bytes_min=132i streamer-send-host_bytes_max=472122i streamer-send-host_bytes_mean=6300i streamer-send-host_bytes_90pct=4199i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1060i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.013647551Z INFO solana_metrics::metrics] datapoint: optimistic_slot_elapsed average_elapsed_ms=458i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.013657311Z INFO solana_metrics::metrics] datapoint: optimistic_slot slot=235001827i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.138855991Z INFO solana_metrics::metrics] datapoint: serialize_account_storage_ms duration=31i num_entries=394131i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.210555274Z INFO solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=200i streamer-send-host_count=1088i streamer-send-bytes_total=9734992i streamer-send-pkt_count_total=11247i streamer-send-host_bytes_min=132i streamer-send-host_bytes_max=664273i streamer-send-host_bytes_mean=8951i streamer-send-host_bytes_90pct=5395i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1050i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.210558464Z INFO solana_streamer::streamer] streamer send Gossip hosts: count:1088 [(208.91.107.10, SendStats { bytes: 554682, count: 640 }), (195.3.221.27, SendStats { bytes: 484534, count: 551 }), (204.16.242.201, SendStats { bytes: 471786, count: 539 }), (104.218.49.90, SendStats { bytes: 603774, count: 689 }), (15.235.83.98, SendStats { bytes: 664235, count: 751 })]
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.263789059Z INFO solana_runtime::snapshot_utils] took 3.9s for slot 235001772 at /mnt/ramdisk-snapshots/snapshot/235001772/235001772.pre
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.263807540Z INFO solana_metrics::metrics] datapoint: snapshot-bank-file slot=235001772i bank_size=726398789i status_cache_size=23830492i bank_serialize_ms=3914i add_snapshot_ms=5487i status_cache_serialize_ms=42i
Dec 10 00:30:52 nero validator.sh[732039]: ERR: unknown error cuda-ecc-ed25519/verify.cu 334
Dec 10 00:30:52 nero validator.sh[732039]: solana-validator: common/gpu_common.h:22: void cuda_assert(cudaError_t, const char*, int): Assertion `0' failed.
The text was updated successfully, but these errors were encountered:
Hi @mcf-rocks - Yes, GPU path should still be supported but you're correct that not many folks are using it today. What version were you successfully running prior to things going wrong with the GPU; presumably another v1.16 release ? As far as I know, nobody has been touching the GPU code lately, and if things had changed, these changes should not have made it back to v1.16 branch. As such, I'm currently wondering if something else might have gone wrong or changed on your system; can you still communicate with the GPU as normal ?
Problem
A node crashed.
It was running Solana version 1.16.21-jito with the --cuda flag.
The operating system is Ubuntu 22.04.3 LTS
The CPU -- Threadripper PRO 5995WX 64-Cores
There were 2 GPUs -- NVIDIA GeForce RTX 4090
The RAM in the machine is ECC
The node ran very good on this machine with --cuda flag for a long time (about 6 months), the machine was upgraded to 1.16.21 two days ago and it did work since that upgrade until last night when it crashed.
From my monitoring it looks like the crash happened directly on the epoch boundary.
note: once crashed the process could not be killed, server required a reboot. Probably because crash was in GPU code.
I know GPUs are out-of-favour since some time, but I believe they are still supported -- let me know if that is not the case.
A similar machine with two GeForce RTX 3090 did not crash.
Here is the end of the log. If you need the full log, let me know and I will get it.
The text was updated successfully, but these errors were encountered: