Dorado 0.8.2 crashes when calling methylation #1098

Kirk3gaard · 2024-10-23T10:54:04Z

Issue Report

Please describe the issue:

dorado 0.8.2 (also) crashes when basecalling with modifications error message below (#1069 )

Please provide a clear and concise description of the issue you are seeing and the result you expect.

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Dorado version:0.8.2
Dorado command: basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
Operating system: Ubuntu 24.04.1 LTS - CUDA 12.4
Hardware (CPUs, Memory, GPUs):24 CPUs, 64 GB RAM, 2x RTX4090
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): on device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): LO-PRO114M,LSK114,N50~8kbp,a 10 min output file ~32 GB
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

[2024-10-23 10:18:30.024] [info] Running: "basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-23 10:18:30.064] [info]  - downloading [email protected] with httplib
[2024-10-23 10:18:32.264] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-23 10:18:32.872] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-23 10:18:33.467] [info] > Creating basecall pipeline
[2024-10-23 10:18:34.858] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/zymo_fecal/.temp_dorado_model-a319234539ca708/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-10-23 10:18:34.877] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/zymo_fecal/.temp_dorado_model-a319234539ca708/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-10-23 10:18:39.665] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-23 10:18:39.665] [info] cuda:1 using chunk size 12288, batch size 128
[2024-10-23 10:18:40.030] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-23 10:18:40.042] [info] cuda:1 using chunk size 6144, batch size 128
[2024-10-23 10:38:38.580] [warning] Caught Torch error 'CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c12568389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c124fdbd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c1256802958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897b516 (0x7c125477b516 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7c1256815de2 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabc9be]
frame #6: <unknown function> + 0x1196e380 (0x7c125d76e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x9ca94 (0x7c124a09ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c3c (0x7c124a129c3c in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

Kirk3gaard · 2024-10-24T06:00:34Z

Crashing w. 1 GPU and SUP without modifications despite reduced batch size but with all files from a run:

[2024-10-23 14:58:04.161] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:1" "sup" "/data/zymo_fecal/pod5"
[2024-10-23 14:58:04.405] [info]  - downloading [email protected] with httplib
[2024-10-23 14:58:06.863] [info] > Creating basecall pipeline
[2024-10-23 14:58:07.581] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-23 14:58:07.976] [info] cuda:1 using chunk size 6144, batch size 96
[2024-10-23 15:12:32.610] [warning] Caught Torch error 'CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x79a4ff8389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x79a4f8dbd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x79a4ff802958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa9e9def (0x79a4ff7e9def in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa9f3ee7 (0x79a4ff7f3ee7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa9f4387 (0x79a4ff7f4387 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x46fe60]
frame #7: <unknown function> + 0x1196e380 (0x79a50676e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0x9ca94 (0x79a4f309ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c3c (0x79a4f3129c3c in /lib/x86_64-linux-gnu/libc.so.6)

Kirk3gaard · 2024-10-24T10:25:00Z

Crashing with 1 GPU and 1 pod5 file with modifications:

[2024-10-24 09:24:56.802] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-24 09:24:56.833] [info]  - downloading [email protected] with httplib
[2024-10-24 09:24:59.084] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-24 09:24:59.331] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-24 09:24:59.572] [info] > Creating basecall pipeline
[2024-10-24 09:25:00.454] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-24 09:25:00.763] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x72b36cc389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x72b3661bd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x72b36cc02958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x905073f (0x72b36b25073f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x49183a5 (0x72b366b183a5 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x72b366b18dc2 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 0x7b (0x72b3676ca64b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x7f98445 (0x72b36a198445 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x72b36772794f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x72b366e0428b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x588e1fb (0x72b367a8e1fb in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x72b3672755e5 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c75e3 (0x72b3678c75e3 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x72b3672fb9a9 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x72b366dfae3b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x5a5b111 (0x72b367c5b111 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #16: at::_ops::to_dtype_layout::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x114 (0x72b36740bb54 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x56c771e (0x72b3678c771e in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #18: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x72b36747a2ee in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #19: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabb917]
frame #20: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabd0ef]
frame #21: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac8845]
frame #22: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x93ae04]
frame #23: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x93b91c]
frame #24: <unknown function> + 0x1196e380 (0x72b373b6e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #25: <unknown function> + 0x9ca94 (0x72b36049ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x129c3c (0x72b360529c3c in /lib/x86_64-linux-gnu/libc.so.6)

blawrence-ont · 2024-10-24T12:39:04Z

Hi,

From looking at the timings of the exceptions that were caught, these seem to happen ~20-30mins after startup. Does this reliably happen after ~20mins? You can time dorado ... to check. Can you try running with the environment variable CUDA_LAUNCH_BLOCKING=1 as suggested in the crash output too?

Some of the crashes in #1070 seem very similar to pytorch/pytorch#74235, so can you try some of the steps provided on there to confirm that it's not a system issue? Notably checking that your BIOS is up to date.

Thanks,
Ben

Kirk3gaard · 2024-10-31T08:02:00Z

Hi Ben
Thanks for your reply.

It does not. appear to be the case with the 20 min intervals. I tried with two different datasets and those finished fine without issues.
Could be something data specific or a temporary issue. I will try to basecall the first dataset again and see if it is able to complete now. If not I will try the suggestions.

Rasmus

Kirk3gaard · 2024-10-31T09:40:47Z

Failed with launchblocking=1 as well on the original dataset.

[2024-10-31 10:11:39.460] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:all" "sup" "pod5/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-31 10:11:39.501] [info]  - downloading [email protected] with httplib
[2024-10-31 10:11:42.195] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-31 10:11:42.442] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-31 10:11:42.684] [info] > Creating basecall pipeline
[2024-10-31 10:11:43.930] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-31 10:11:43.930] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-31 10:11:44.302] [info] cuda:0 using chunk size 6144, batch size 96
[2024-10-31 10:11:44.323] [info] cuda:1 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::CuDNNError'
  what():  cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from _cudnn_rnn at /pytorch/pyold/aten/src/ATen/native/cudnn/RNN.cpp:1090 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7b845a4389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x3f2256a (0x7b845392256a in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa6443db (0x7b845a0443db in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa66717f (0x7b845a06717f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x513b959 (0x7b8454b3b959 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::_ops::_cudnn_rnn::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, c10::SymInt, c10::SymInt, long, bool, double, bool, bool, c10::ArrayRef<c10::SymInt>, c10::optional<at::Tensor> const&) + 0x3ba (0x7b8454aa1cda in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0x89f13b6 (0x7b84583f13b6 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x89e9b7d (0x7b84583e9b7d in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0x89ea334 (0x7b84583ea334 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::native::lstm(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x2b2 (0x7b84544dbd32 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5a5b39d (0x7b845545b39d in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::lstm_input::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x265 (0x7b8454c7c295 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: torch::nn::LSTMImpl::forward_helper(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0x6dc (0x7b8457aaec9c in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: torch::nn::LSTMImpl::forward(at::Tensor const&, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0xbc (0x7b8457aaee4c in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: dorado() [0xacfd00]
frame #15: dorado() [0xad4398]
frame #16: dorado() [0xac0370]
frame #17: dorado() [0xac04d8]
frame #18: dorado() [0xabc963]
frame #19: <unknown function> + 0x1196e380 (0x7b846136e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #20: <unknown function> + 0x9ca94 (0x7b844dc9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x129c3c (0x7b844dd29c3c in /lib/x86_64-linux-gnu/libc.so.6)

dorado-WS5_mods.sh: line 5:  4852 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 dorado basecaller --batchsize 96 --device "cuda:all" sup pod5/ --modified-bases 4mC_5mC 6mA > modcalls.bam

Kirk3gaard · 2024-11-07T11:23:51Z

Hmm this is issue is weird. Could it be something within dorado that is poor at handling temporary issues when starting up? I ran the same sbatch script with the same data with our 2xA10 node twice.

The first time it crashed and the second time it completed perfectly fine. I have managed to run the same script with different datasets (barcodes from the same run) a few times and occasionally I get an error but often it runs smoothly when rerunning.

[2024-11-05 13:24:00.927] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/projects/MicroBench/data/pod5/PAW78174_barcode01/" "--modified-bases" "4mC_5mC" "6mA"
[2024-11-05 13:24:02.024] [info]  - downloading [email protected] with httplib
[2024-11-05 13:24:07.190] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-11-05 13:24:07.991] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-11-05 13:24:08.830] [info] > Creating basecall pipeline
[2024-11-05 13:24:15.823] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-7649e24751e89ca3/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:24:16.049] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-7649e24751e89ca3/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:24:26.582] [info] cuda:1 using chunk size 12288, batch size 160
[2024-11-05 13:24:27.496] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-05 13:24:27.673] [info] cuda:1 using chunk size 6144, batch size 256
[2024-11-05 13:24:28.976] [info] cuda:0 using chunk size 6144, batch size 288
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9d0e63a9b7 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9d07bbf115 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7f9d0e50a79b in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0x881 (0x7f9d0e50bfd1 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7f9d0e505be4 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7f9d0e50fbf8 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7f9d0e517102 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7f9d0e219dd4 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7f9d0e219e6d in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7f9d08cf0951 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7f9d0bc68f33 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xad007a]
frame #12: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xad4398]
frame #13: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac0370]
frame #14: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac04d8]
frame #15: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabc963]
frame #16: <unknown function> + 0x1196e380 (0x7f9d15570380 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x94ac3 (0x7f9d02494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: <unknown function> + 0x126850 (0x7f9d02526850 in /lib/x86_64-linux-gnu/libc.so.6)

/var/spool/slurm/d/job1141737/slurm_script: line 35: 674202 Aborted                 (core dumped) $DORADO basecaller --device "cuda:all" $BASECALLINGMODEL $INPUTDIR --modified-bases $MODS > $OUTPUTFILE.mod.bam

[2024-11-05 13:31:02.841] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/projects/MicroBench/data/pod5/PAW78174_barcode01/" "--modified-bases" "4mC_5mC" "6mA"
[2024-11-05 13:31:03.224] [info]  - downloading [email protected] with httplib
[2024-11-05 13:31:09.031] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-11-05 13:31:09.707] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-11-05 13:31:10.074] [info] > Creating basecall pipeline
[2024-11-05 13:31:13.366] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-9b35920d625d03d9/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:31:13.407] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-9b35920d625d03d9/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:31:21.970] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-05 13:31:21.970] [info] cuda:1 using chunk size 12288, batch size 224
[2024-11-05 13:31:23.429] [info] cuda:0 using chunk size 6144, batch size 352
[2024-11-05 13:31:23.469] [info] cuda:1 using chunk size 6144, batch size 448
[2024-11-05 15:05:20.751] [info] > Simplex reads basecalled: 62890
[2024-11-05 15:05:20.798] [info] > Basecalled @ Samples/s: 1.744681e+06
[2024-11-05 15:05:21.088] [info] > Finished

caspargross · 2024-12-09T17:44:23Z

@Kirk3gaard Did you find a solution for this problem seeing that you closed the issue?

iiSeymour added bug Something isn't working mods For issues related to modified base calling labels Oct 29, 2024

Kirk3gaard closed this as completed Nov 8, 2024

caspargross mentioned this issue Dec 8, 2024

dorado 0.7.0 RNA004 modbase calling results in CUDA error #842

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dorado 0.8.2 crashes when calling methylation #1098

Dorado 0.8.2 crashes when calling methylation #1098

Kirk3gaard commented Oct 23, 2024 •

edited

Loading

Kirk3gaard commented Oct 24, 2024 •

edited

Loading

Kirk3gaard commented Oct 24, 2024 •

edited

Loading

blawrence-ont commented Oct 24, 2024

Kirk3gaard commented Oct 31, 2024

Kirk3gaard commented Oct 31, 2024 •

edited

Loading

Kirk3gaard commented Nov 7, 2024 •

edited

Loading

caspargross commented Dec 9, 2024

Dorado 0.8.2 crashes when calling methylation #1098

Dorado 0.8.2 crashes when calling methylation #1098

Comments

Kirk3gaard commented Oct 23, 2024 • edited Loading

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs

Kirk3gaard commented Oct 24, 2024 • edited Loading

Kirk3gaard commented Oct 24, 2024 • edited Loading

blawrence-ont commented Oct 24, 2024

Kirk3gaard commented Oct 31, 2024

Kirk3gaard commented Oct 31, 2024 • edited Loading

Kirk3gaard commented Nov 7, 2024 • edited Loading

caspargross commented Dec 9, 2024

Kirk3gaard commented Oct 23, 2024 •

edited

Loading

Kirk3gaard commented Oct 24, 2024 •

edited

Loading

Kirk3gaard commented Oct 24, 2024 •

edited

Loading

Kirk3gaard commented Oct 31, 2024 •

edited

Loading

Kirk3gaard commented Nov 7, 2024 •

edited

Loading