Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado 0.8.2 crashes when calling methylation #1098

Closed
Kirk3gaard opened this issue Oct 23, 2024 · 7 comments
Closed

Dorado 0.8.2 crashes when calling methylation #1098

Kirk3gaard opened this issue Oct 23, 2024 · 7 comments
Labels
bug Something isn't working mods For issues related to modified base calling

Comments

@Kirk3gaard
Copy link

Kirk3gaard commented Oct 23, 2024

Issue Report

Please describe the issue:

dorado 0.8.2 (also) crashes when basecalling with modifications error message below (#1069 )

Please provide a clear and concise description of the issue you are seeing and the result you expect.

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

  • Dorado version:0.8.2
  • Dorado command: basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
  • Operating system: Ubuntu 24.04.1 LTS - CUDA 12.4
  • Hardware (CPUs, Memory, GPUs):24 CPUs, 64 GB RAM, 2x RTX4090
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): LO-PRO114M,LSK114,N50~8kbp,a 10 min output file ~32 GB
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

[2024-10-23 10:18:30.024] [info] Running: "basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-23 10:18:30.064] [info]  - downloading [email protected] with httplib
[2024-10-23 10:18:32.264] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-23 10:18:32.872] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-23 10:18:33.467] [info] > Creating basecall pipeline
[2024-10-23 10:18:34.858] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/zymo_fecal/.temp_dorado_model-a319234539ca708/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-10-23 10:18:34.877] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/zymo_fecal/.temp_dorado_model-a319234539ca708/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-10-23 10:18:39.665] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-23 10:18:39.665] [info] cuda:1 using chunk size 12288, batch size 128
[2024-10-23 10:18:40.030] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-23 10:18:40.042] [info] cuda:1 using chunk size 6144, batch size 128
[2024-10-23 10:38:38.580] [warning] Caught Torch error 'CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c12568389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c124fdbd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c1256802958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897b516 (0x7c125477b516 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7c1256815de2 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabc9be]
frame #6: <unknown function> + 0x1196e380 (0x7c125d76e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x9ca94 (0x7c124a09ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c3c (0x7c124a129c3c in /lib/x86_64-linux-gnu/libc.so.6)
@Kirk3gaard
Copy link
Author

Kirk3gaard commented Oct 24, 2024

Crashing w. 1 GPU and SUP without modifications despite reduced batch size but with all files from a run:

[2024-10-23 14:58:04.161] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:1" "sup" "/data/zymo_fecal/pod5"
[2024-10-23 14:58:04.405] [info]  - downloading [email protected] with httplib
[2024-10-23 14:58:06.863] [info] > Creating basecall pipeline
[2024-10-23 14:58:07.581] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-23 14:58:07.976] [info] cuda:1 using chunk size 6144, batch size 96
[2024-10-23 15:12:32.610] [warning] Caught Torch error 'CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x79a4ff8389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x79a4f8dbd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x79a4ff802958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa9e9def (0x79a4ff7e9def in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa9f3ee7 (0x79a4ff7f3ee7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa9f4387 (0x79a4ff7f4387 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x46fe60]
frame #7: <unknown function> + 0x1196e380 (0x79a50676e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0x9ca94 (0x79a4f309ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c3c (0x79a4f3129c3c in /lib/x86_64-linux-gnu/libc.so.6)

@Kirk3gaard
Copy link
Author

Kirk3gaard commented Oct 24, 2024

Crashing with 1 GPU and 1 pod5 file with modifications:

[2024-10-24 09:24:56.802] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-24 09:24:56.833] [info]  - downloading [email protected] with httplib
[2024-10-24 09:24:59.084] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-24 09:24:59.331] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-24 09:24:59.572] [info] > Creating basecall pipeline
[2024-10-24 09:25:00.454] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-24 09:25:00.763] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x72b36cc389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x72b3661bd115 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x72b36cc02958 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x905073f (0x72b36b25073f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x49183a5 (0x72b366b183a5 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x72b366b18dc2 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 0x7b (0x72b3676ca64b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x7f98445 (0x72b36a198445 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x72b36772794f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x72b366e0428b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x588e1fb (0x72b367a8e1fb in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x72b3672755e5 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c75e3 (0x72b3678c75e3 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x72b3672fb9a9 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x72b366dfae3b in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x5a5b111 (0x72b367c5b111 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #16: at::_ops::to_dtype_layout::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x114 (0x72b36740bb54 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x56c771e (0x72b3678c771e in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #18: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x72b36747a2ee in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #19: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabb917]
frame #20: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabd0ef]
frame #21: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac8845]
frame #22: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x93ae04]
frame #23: /data/software/dorado-0.8.2-linux-x64/bin/dorado() [0x93b91c]
frame #24: <unknown function> + 0x1196e380 (0x72b373b6e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #25: <unknown function> + 0x9ca94 (0x72b36049ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x129c3c (0x72b360529c3c in /lib/x86_64-linux-gnu/libc.so.6)

@blawrence-ont
Copy link
Collaborator

Hi,

From looking at the timings of the exceptions that were caught, these seem to happen ~20-30mins after startup. Does this reliably happen after ~20mins? You can time dorado ... to check. Can you try running with the environment variable CUDA_LAUNCH_BLOCKING=1 as suggested in the crash output too?

Some of the crashes in #1070 seem very similar to pytorch/pytorch#74235, so can you try some of the steps provided on there to confirm that it's not a system issue? Notably checking that your BIOS is up to date.

Thanks,
Ben

@iiSeymour iiSeymour added bug Something isn't working mods For issues related to modified base calling labels Oct 29, 2024
@Kirk3gaard
Copy link
Author

Hi Ben
Thanks for your reply.

It does not. appear to be the case with the 20 min intervals. I tried with two different datasets and those finished fine without issues.
Could be something data specific or a temporary issue. I will try to basecall the first dataset again and see if it is able to complete now. If not I will try the suggestions.

Rasmus

@Kirk3gaard
Copy link
Author

Kirk3gaard commented Oct 31, 2024

Failed with launchblocking=1 as well on the original dataset.

[2024-10-31 10:11:39.460] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:all" "sup" "pod5/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-31 10:11:39.501] [info]  - downloading [email protected] with httplib
[2024-10-31 10:11:42.195] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-31 10:11:42.442] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-10-31 10:11:42.684] [info] > Creating basecall pipeline
[2024-10-31 10:11:43.930] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-31 10:11:43.930] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-31 10:11:44.302] [info] cuda:0 using chunk size 6144, batch size 96
[2024-10-31 10:11:44.323] [info] cuda:1 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::CuDNNError'
  what():  cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from _cudnn_rnn at /pytorch/pyold/aten/src/ATen/native/cudnn/RNN.cpp:1090 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7b845a4389b7 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x3f2256a (0x7b845392256a in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa6443db (0x7b845a0443db in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa66717f (0x7b845a06717f in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x513b959 (0x7b8454b3b959 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::_ops::_cudnn_rnn::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, c10::SymInt, c10::SymInt, long, bool, double, bool, bool, c10::ArrayRef<c10::SymInt>, c10::optional<at::Tensor> const&) + 0x3ba (0x7b8454aa1cda in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0x89f13b6 (0x7b84583f13b6 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x89e9b7d (0x7b84583e9b7d in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0x89ea334 (0x7b84583ea334 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::native::lstm(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x2b2 (0x7b84544dbd32 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5a5b39d (0x7b845545b39d in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::lstm_input::call(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x265 (0x7b8454c7c295 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: torch::nn::LSTMImpl::forward_helper(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0x6dc (0x7b8457aaec9c in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: torch::nn::LSTMImpl::forward(at::Tensor const&, c10::optional<std::tuple<at::Tensor, at::Tensor> >) + 0xbc (0x7b8457aaee4c in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: dorado() [0xacfd00]
frame #15: dorado() [0xad4398]
frame #16: dorado() [0xac0370]
frame #17: dorado() [0xac04d8]
frame #18: dorado() [0xabc963]
frame #19: <unknown function> + 0x1196e380 (0x7b846136e380 in /data/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #20: <unknown function> + 0x9ca94 (0x7b844dc9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x129c3c (0x7b844dd29c3c in /lib/x86_64-linux-gnu/libc.so.6)

dorado-WS5_mods.sh: line 5:  4852 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 dorado basecaller --batchsize 96 --device "cuda:all" sup pod5/ --modified-bases 4mC_5mC 6mA > modcalls.bam

@Kirk3gaard
Copy link
Author

Kirk3gaard commented Nov 7, 2024

Hmm this is issue is weird. Could it be something within dorado that is poor at handling temporary issues when starting up? I ran the same sbatch script with the same data with our 2xA10 node twice.

The first time it crashed and the second time it completed perfectly fine. I have managed to run the same script with different datasets (barcodes from the same run) a few times and occasionally I get an error but often it runs smoothly when rerunning.

[2024-11-05 13:24:00.927] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/projects/MicroBench/data/pod5/PAW78174_barcode01/" "--modified-bases" "4mC_5mC" "6mA"
[2024-11-05 13:24:02.024] [info]  - downloading [email protected] with httplib
[2024-11-05 13:24:07.190] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-11-05 13:24:07.991] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-11-05 13:24:08.830] [info] > Creating basecall pipeline
[2024-11-05 13:24:15.823] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-7649e24751e89ca3/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:24:16.049] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-7649e24751e89ca3/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:24:26.582] [info] cuda:1 using chunk size 12288, batch size 160
[2024-11-05 13:24:27.496] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-05 13:24:27.673] [info] cuda:1 using chunk size 6144, batch size 256
[2024-11-05 13:24:28.976] [info] cuda:0 using chunk size 6144, batch size 288
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9d0e63a9b7 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9d07bbf115 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7f9d0e50a79b in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0x881 (0x7f9d0e50bfd1 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7f9d0e505be4 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7f9d0e50fbf8 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7f9d0e517102 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7f9d0e219dd4 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7f9d0e219e6d in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7f9d08cf0951 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7f9d0bc68f33 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xad007a]
frame #12: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xad4398]
frame #13: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac0370]
frame #14: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xac04d8]
frame #15: /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/dorado() [0xabc963]
frame #16: <unknown function> + 0x1196e380 (0x7f9d15570380 in /home/bio.aau.dk/ur36rv/software/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x94ac3 (0x7f9d02494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: <unknown function> + 0x126850 (0x7f9d02526850 in /lib/x86_64-linux-gnu/libc.so.6)

/var/spool/slurm/d/job1141737/slurm_script: line 35: 674202 Aborted                 (core dumped) $DORADO basecaller --device "cuda:all" $BASECALLINGMODEL $INPUTDIR --modified-bases $MODS > $OUTPUTFILE.mod.bam
[2024-11-05 13:31:02.841] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/projects/MicroBench/data/pod5/PAW78174_barcode01/" "--modified-bases" "4mC_5mC" "6mA"
[2024-11-05 13:31:03.224] [info]  - downloading [email protected] with httplib
[2024-11-05 13:31:09.031] [info]  - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-11-05 13:31:09.707] [info]  - downloading [email protected]_6mA@v2 with httplib
[2024-11-05 13:31:10.074] [info] > Creating basecall pipeline
[2024-11-05 13:31:13.366] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-9b35920d625d03d9/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:31:13.407] [info] Calculating optimized batch size for GPU "NVIDIA A10" and model /projects/MicroBench/data/.temp_dorado_model-9b35920d625d03d9/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-11-05 13:31:21.970] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-05 13:31:21.970] [info] cuda:1 using chunk size 12288, batch size 224
[2024-11-05 13:31:23.429] [info] cuda:0 using chunk size 6144, batch size 352
[2024-11-05 13:31:23.469] [info] cuda:1 using chunk size 6144, batch size 448
[2024-11-05 15:05:20.751] [info] > Simplex reads basecalled: 62890
[2024-11-05 15:05:20.798] [info] > Basecalled @ Samples/s: 1.744681e+06
[2024-11-05 15:05:21.088] [info] > Finished

@caspargross
Copy link

@Kirk3gaard Did you find a solution for this problem seeing that you closed the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mods For issues related to modified base calling
Projects
None yet
Development

No branches or pull requests

4 participants