Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory (RAM) usage for dorado correct #1137

Open
stephrom opened this issue Nov 18, 2024 · 2 comments
Open

memory (RAM) usage for dorado correct #1137

stephrom opened this issue Nov 18, 2024 · 2 comments
Labels
read_correction Read error correction

Comments

@stephrom
Copy link

stephrom commented Nov 18, 2024

Issue Report

Please describe the issue:

I try to use dorado correct, but never get any success due to RAM shortage. The process starts and delivered with the available resources (see below) 436Mb of output before OOM-kill
289G Nov 14 14:03 CornBorer_reads.fastq
684M Nov 15 04:05 CornBorer_reads.fastq.fai
436M Nov 18 13:49 corrected_CornBorer_reads.fastq
This (289G) is genomic data.

Steps to reproduce the issue:

#SBACTH -c 24
#SBATCH --gres=gpu:2
#SBATCH --time=72:0:0
#SBATCH --mem=420G
$bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FA}

Run environment:

  • Dorado version: 0.8.3
  • Dorado command: dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ
  • Operating system:RHEL8
  • Hardware (CPUs, Memory, GPUs): 2x 24-core AMD EPYC 7413 (Milan @ 2.2 GHz); 500Gb RAM; 4x NVIDIA Ampere A100 GPUs (80GB GPU memory)
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): fastq
  • Source data location (on device or networked drive - NFS, etc.): Network share (HDR InfiniBand)
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

[vsc40014@gligar08 CORN]$ head -n50 dorado_correct_15414825.err
[2024-11-18 09:43:11.218] [info] Running: "correct" "--verbose" "--threads" "24" "--index-size" "4G" "--batch-size" "16" "--device" "cuda:all" "CornBorer_reads.fastq"
[2024-11-18 09:43:11.550] [debug] Aligner threads 24, corrector threads 6, writer threads 1
[2024-11-18 09:43:11.561] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2024-11-18 09:43:11.564] [info]  - downloading herro-v1 with httplib
[2024-11-18 09:43:11.640] [error] Failed to download herro-v1: SSL server verification failed
[2024-11-18 09:43:11.640] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.3M  100 22.3M    0     0  54.8M      0 --:--:-- --:--:-- --:--:-- 54.8M
[2024-11-18 09:43:12.217] [debug] furthest_skip_header = '', furthest_skip_id = -1
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 1.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 1.
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Looking for idx CornBorer_reads.fastq.fai
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.352] [debug] Initialized index options.
[2024-11-18 09:43:12.352] [debug] Loading index...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:54.665] [debug] Loaded index with 240571 target seqs
[2024-11-18 09:43:56.613] [debug] Loaded mm2 index.
[2024-11-18 09:43:56.614] [info] Starting
[2024-11-18 09:43:56.614] [debug] Align with index 0
[2024-11-18 09:43:57.829] [debug] Read 10000 reads
[2024-11-18 09:44:01.887] [debug] Alignments processed 10000, total m_corrected_records size 130.63971 MB
[2024-11-18 09:44:05.846] [debug] Read 20000 reads
[2024-11-18 09:44:10.193] [debug] Alignments processed 20000, total m_corrected_records size 353.4832 MB
[2024-11-18 09:44:14.394] [debug] Read 30000 reads
[2024-11-18 09:44:18.944] [debug] Alignments processed 30001, total m_corrected_records size 577.7794 MB
[2024-11-18 09:44:22.934] [debug] Read 40000 reads
[2024-11-18 09:44:27.507] [debug] Alignments processed 40001, total m_corrected_records size 796.8967 MB
[2024-11-18 09:44:31.691] [debug] Read 50000 reads
[2024-11-18 09:44:36.586] [debug] Alignments processed 50000, total m_corrected_records size 1026.6725 MB
[2024-11-18 09:44:41.385] [debug] Read 60000 reads
[2024-11-18 09:44:44.616] [debug] Alignments processed 60007, total m_corrected_records size 1217.6389 MB
[2024-11-18 09:44:47.573] [debug] Read 70000 reads
...
[2024-11-18 13:49:08.992] [debug] Alignments processed 7920000, total m_corrected_records size 165349.5 MB
[2024-11-18 13:49:14.332] [debug] Read 7930000 reads
[2024-11-18 13:49:19.343] [debug] Alignments processed 7930001, total m_corrected_records size 165589.53 MB
[2024-11-18 13:49:25.549] [debug] Read 7940000 reads
[2024-11-18 13:49:33.678] [debug] Alignments processed 7940000, total m_corrected_records size 165829.38 MB
[2024-11-18 13:49:39.586] [debug] Read 7950000 reads
[2024-11-18 13:49:45.336] [debug] Alignments processed 7950002, total m_corrected_records size 166068.16 MB
[2024-11-18 13:49:49.432] [debug] Read 7960000 reads
[2024-11-18 13:49:54.683] [debug] Alignments processed 7960003, total m_corrected_records size 166287.48 MB
[2024-11-18 13:49:59.829] [debug] Read 7970000 reads
[2024-11-18 13:50:05.495] [debug] Alignments processed 7970000, total m_corrected_records size 166535.27 MB
[2024-11-18 13:50:10.317] [debug] Read 7980000 reads
[2024-11-18 13:50:15.363] [debug] Alignments processed 7980000, total m_corrected_records size 166763.31 MB
[2024-11-18 13:50:19.987] [debug] Read 7990000 reads
[2024-11-18 13:50:25.591] [debug] Alignments processed 7990020, total m_corrected_records size 166999.25 MB
[2024-11-18 13:50:30.512] [debug] Read 8000000 reads
[2024-11-18 13:50:35.650] [debug] Alignments processed 8000000, total m_corrected_records size 167237.64 MB
[2024-11-18 13:50:40.581] [debug] Read 8010000 reads
/var/spool/slurm/slurmd/job15414825/slurm_script: line 27: 15407 Killed                  $bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FQ}
slurmstepd: error: Detected 1 oom_kill event in StepId=15414825.batch. Some of the step tasks have been OOM Killed.
@HalfPhoton HalfPhoton added the read_correction Read error correction label Nov 19, 2024
@HalfPhoton
Copy link
Collaborator

HalfPhoton commented Nov 19, 2024

Hi @stephrom,
Is your input dataset of a very high depth?

Best regards,
Rich

@stephrom
Copy link
Author

stephrom commented Nov 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
read_correction Read error correction
Projects
None yet
Development

No branches or pull requests

2 participants