memory (RAM) usage for dorado correct #1137

stephrom · 2024-11-18T13:37:13Z

Issue Report

Please describe the issue:

I try to use dorado correct, but never get any success due to RAM shortage. The process starts and delivered with the available resources (see below) 436Mb of output before OOM-kill
289G Nov 14 14:03 CornBorer_reads.fastq
684M Nov 15 04:05 CornBorer_reads.fastq.fai
436M Nov 18 13:49 corrected_CornBorer_reads.fastq
This (289G) is genomic data.

Steps to reproduce the issue:

#SBACTH -c 24
#SBATCH --gres=gpu:2
#SBATCH --time=72:0:0
#SBATCH --mem=420G
$bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FA}

Run environment:

Dorado version: 0.8.3
Dorado command: dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ
Operating system:RHEL8
Hardware (CPUs, Memory, GPUs): 2x 24-core AMD EPYC 7413 (Milan @ 2.2 GHz); 500Gb RAM; 4x NVIDIA Ampere A100 GPUs (80GB GPU memory)
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): fastq
Source data location (on device or networked drive - NFS, etc.): Network share (HDR InfiniBand)
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

[vsc40014@gligar08 CORN]$ head -n50 dorado_correct_15414825.err
[2024-11-18 09:43:11.218] [info] Running: "correct" "--verbose" "--threads" "24" "--index-size" "4G" "--batch-size" "16" "--device" "cuda:all" "CornBorer_reads.fastq"
[2024-11-18 09:43:11.550] [debug] Aligner threads 24, corrector threads 6, writer threads 1
[2024-11-18 09:43:11.561] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2024-11-18 09:43:11.564] [info]  - downloading herro-v1 with httplib
[2024-11-18 09:43:11.640] [error] Failed to download herro-v1: SSL server verification failed
[2024-11-18 09:43:11.640] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.3M  100 22.3M    0     0  54.8M      0 --:--:-- --:--:-- --:--:-- 54.8M
[2024-11-18 09:43:12.217] [debug] furthest_skip_header = '', furthest_skip_id = -1
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 1.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 1.
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Looking for idx CornBorer_reads.fastq.fai
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.352] [debug] Initialized index options.
[2024-11-18 09:43:12.352] [debug] Loading index...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:54.665] [debug] Loaded index with 240571 target seqs
[2024-11-18 09:43:56.613] [debug] Loaded mm2 index.
[2024-11-18 09:43:56.614] [info] Starting
[2024-11-18 09:43:56.614] [debug] Align with index 0
[2024-11-18 09:43:57.829] [debug] Read 10000 reads
[2024-11-18 09:44:01.887] [debug] Alignments processed 10000, total m_corrected_records size 130.63971 MB
[2024-11-18 09:44:05.846] [debug] Read 20000 reads
[2024-11-18 09:44:10.193] [debug] Alignments processed 20000, total m_corrected_records size 353.4832 MB
[2024-11-18 09:44:14.394] [debug] Read 30000 reads
[2024-11-18 09:44:18.944] [debug] Alignments processed 30001, total m_corrected_records size 577.7794 MB
[2024-11-18 09:44:22.934] [debug] Read 40000 reads
[2024-11-18 09:44:27.507] [debug] Alignments processed 40001, total m_corrected_records size 796.8967 MB
[2024-11-18 09:44:31.691] [debug] Read 50000 reads
[2024-11-18 09:44:36.586] [debug] Alignments processed 50000, total m_corrected_records size 1026.6725 MB
[2024-11-18 09:44:41.385] [debug] Read 60000 reads
[2024-11-18 09:44:44.616] [debug] Alignments processed 60007, total m_corrected_records size 1217.6389 MB
[2024-11-18 09:44:47.573] [debug] Read 70000 reads
...
[2024-11-18 13:49:08.992] [debug] Alignments processed 7920000, total m_corrected_records size 165349.5 MB
[2024-11-18 13:49:14.332] [debug] Read 7930000 reads
[2024-11-18 13:49:19.343] [debug] Alignments processed 7930001, total m_corrected_records size 165589.53 MB
[2024-11-18 13:49:25.549] [debug] Read 7940000 reads
[2024-11-18 13:49:33.678] [debug] Alignments processed 7940000, total m_corrected_records size 165829.38 MB
[2024-11-18 13:49:39.586] [debug] Read 7950000 reads
[2024-11-18 13:49:45.336] [debug] Alignments processed 7950002, total m_corrected_records size 166068.16 MB
[2024-11-18 13:49:49.432] [debug] Read 7960000 reads
[2024-11-18 13:49:54.683] [debug] Alignments processed 7960003, total m_corrected_records size 166287.48 MB
[2024-11-18 13:49:59.829] [debug] Read 7970000 reads
[2024-11-18 13:50:05.495] [debug] Alignments processed 7970000, total m_corrected_records size 166535.27 MB
[2024-11-18 13:50:10.317] [debug] Read 7980000 reads
[2024-11-18 13:50:15.363] [debug] Alignments processed 7980000, total m_corrected_records size 166763.31 MB
[2024-11-18 13:50:19.987] [debug] Read 7990000 reads
[2024-11-18 13:50:25.591] [debug] Alignments processed 7990020, total m_corrected_records size 166999.25 MB
[2024-11-18 13:50:30.512] [debug] Read 8000000 reads
[2024-11-18 13:50:35.650] [debug] Alignments processed 8000000, total m_corrected_records size 167237.64 MB
[2024-11-18 13:50:40.581] [debug] Read 8010000 reads
/var/spool/slurm/slurmd/job15414825/slurm_script: line 27: 15407 Killed                  $bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FQ}
slurmstepd: error: Detected 1 oom_kill event in StepId=15414825.batch. Some of the step tasks have been OOM Killed.

The text was updated successfully, but these errors were encountered:

HalfPhoton · 2024-11-19T13:21:22Z

Hi @stephrom,
Is your input dataset of a very high depth?

Best regards,
Rich

stephrom · 2024-11-19T13:40:06Z

Hi, not particularly, it is genomic data from a corn borer. best Stephane ================================================================== Stephane Rombauts Principal Scientific staff Bioinformatics & Systems Biology Division Tel:+32 <tel:+32> (0)9 331 38 21 fax:+32 (0)9 3313809 VIB-UGent Center for Plant Systems Biology, Ghent University Technologiepark 71, 9052 Gent, BELGIUM ***@***.*** ***@***.***> http://bioinformatics.psb.ugent.be/ ==================================================================

…

On 19 Nov 2024, at 14:21, Richard Harris ***@***.***> wrote: Warning: This email originated from outside of PSB. Do not click on links or open attachments unless you are certain that this email is safe. Hi @stephrom <https://github.com/stephrom>, Is your input dataset of a very high depth? Best regards, Rich — Reply to this email directly, view it on GitHub <#1137 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEBCVA43H35OTSWYMFZG5CT2BM3ORAVCNFSM6AAAAABR7VWUX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGY4TSOJRHE>. You are receiving this because you were mentioned.

HalfPhoton added the read_correction Read error correction label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory (RAM) usage for dorado correct #1137

memory (RAM) usage for dorado correct #1137

stephrom commented Nov 18, 2024 •

edited by malton-ont

Loading

HalfPhoton commented Nov 19, 2024 •

edited

Loading

stephrom commented Nov 19, 2024 via email

memory (RAM) usage for dorado correct #1137

memory (RAM) usage for dorado correct #1137

Comments

stephrom commented Nov 18, 2024 • edited by malton-ont Loading

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs

HalfPhoton commented Nov 19, 2024 • edited Loading

stephrom commented Nov 19, 2024 via email

stephrom commented Nov 18, 2024 •

edited by malton-ont

Loading

HalfPhoton commented Nov 19, 2024 •

edited

Loading