Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.16.21 Node crashed - running with cuda flag and two RTX 4090 GPUs #34390

Open
mcf-rocks opened this issue Dec 10, 2023 · 1 comment
Open

1.16.21 Node crashed - running with cuda flag and two RTX 4090 GPUs #34390

mcf-rocks opened this issue Dec 10, 2023 · 1 comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@mcf-rocks
Copy link

mcf-rocks commented Dec 10, 2023

Problem

A node crashed.
It was running Solana version 1.16.21-jito with the --cuda flag.
The operating system is Ubuntu 22.04.3 LTS
The CPU -- Threadripper PRO 5995WX 64-Cores
There were 2 GPUs -- NVIDIA GeForce RTX 4090
The RAM in the machine is ECC

The node ran very good on this machine with --cuda flag for a long time (about 6 months), the machine was upgraded to 1.16.21 two days ago and it did work since that upgrade until last night when it crashed.

From my monitoring it looks like the crash happened directly on the epoch boundary.

note: once crashed the process could not be killed, server required a reboot. Probably because crash was in GPU code.

I know GPUs are out-of-favour since some time, but I believe they are still supported -- let me know if that is not the case.

A similar machine with two GeForce RTX 3090 did not crash.

Here is the end of the log. If you need the full log, let me know and I will get it.

Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.010135821Z INFO  solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=199i streamer-send-host_count=1117i streamer-send-bytes_total=7033190i streamer-send-pkt_count_total=8215i streamer-send-host_bytes_min=132i streamer-send-host_bytes_max=472122i streamer-send-host_bytes_mean=6300i streamer-send-host_bytes_90pct=4199i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1060i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.013647551Z INFO  solana_metrics::metrics] datapoint: optimistic_slot_elapsed average_elapsed_ms=458i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.013657311Z INFO  solana_metrics::metrics] datapoint: optimistic_slot slot=235001827i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.138855991Z INFO  solana_metrics::metrics] datapoint: serialize_account_storage_ms duration=31i num_entries=394131i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.210555274Z INFO  solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=200i streamer-send-host_count=1088i streamer-send-bytes_total=9734992i streamer-send-pkt_count_total=11247i streamer-send-host_bytes_min=132i streamer-send-host_bytes_max=664273i streamer-send-host_bytes_mean=8951i streamer-send-host_bytes_90pct=5395i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1050i
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.210558464Z INFO  solana_streamer::streamer] streamer send Gossip hosts: count:1088 [(208.91.107.10, SendStats { bytes: 554682, count: 640 }), (195.3.221.27, SendStats { bytes: 484534, count: 551 }), (204.16.242.201, SendStats { bytes: 471786, count: 539 }), (104.218.49.90, SendStats { bytes: 603774, count: 689 }), (15.235.83.98, SendStats { bytes: 664235, count: 751 })]
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.263789059Z INFO  solana_runtime::snapshot_utils]  took 3.9s for slot 235001772 at /mnt/ramdisk-snapshots/snapshot/235001772/235001772.pre
Dec 10 00:30:52 nero validator.sh[732039]: [2023-12-10T00:30:52.263807540Z INFO  solana_metrics::metrics] datapoint: snapshot-bank-file slot=235001772i bank_size=726398789i status_cache_size=23830492i bank_serialize_ms=3914i add_snapshot_ms=5487i status_cache_serialize_ms=42i
Dec 10 00:30:52 nero validator.sh[732039]: ERR: unknown error cuda-ecc-ed25519/verify.cu 334
Dec 10 00:30:52 nero validator.sh[732039]: solana-validator: common/gpu_common.h:22: void cuda_assert(cudaError_t, const char*, int): Assertion `0' failed.
@mcf-rocks mcf-rocks added the community Community contribution label Dec 10, 2023
@steviez
Copy link
Contributor

steviez commented Dec 12, 2023

Hi @mcf-rocks - Yes, GPU path should still be supported but you're correct that not many folks are using it today. What version were you successfully running prior to things going wrong with the GPU; presumably another v1.16 release ? As far as I know, nobody has been touching the GPU code lately, and if things had changed, these changes should not have made it back to v1.16 branch. As such, I'm currently wondering if something else might have gone wrong or changed on your system; can you still communicate with the GPU as normal ?

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

2 participants