Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve run time for fastq.gz files suggestions #392

Open
SergeWielhouwer opened this issue Dec 11, 2024 · 2 comments
Open

Improve run time for fastq.gz files suggestions #392

SergeWielhouwer opened this issue Dec 11, 2024 · 2 comments

Comments

@SergeWielhouwer
Copy link

SergeWielhouwer commented Dec 11, 2024

Hi,

Thanks for developing NanoPlot.

I am currently running NanoPlot within a custom Snakemake ONT pipeline, but I've noticed that it tends to be one of the slower rules in my workflow when processing the raw input data.
image

Below is a rough example of the run times for three samples:
34.6 Gbp = 57 min
24.4 Gbp = 40 min
13.4 Gbp = 22 min

I’m using NanoPlot 1.42.0 with 4 threads and running the following command on a single (merged) fastq.gz file stored on flash storage (default compression):

NanoPlot --fastq {input} -o nanoplot/raw/{wildcards.sample}/ -t {threads} 2>{log}

While I understand that providing multiple smaller fastq.gz files might help improve speed, I’m curious if NanoPlot benefits from utilising multiple threads on a single fastq.gz file, or if that’s more applicable to BAM files with multiple reference contigs.

As shown in the plot, your chopper rust tool (with pigz decompression) processes the data in roughly a third of the time that NanoPlot requires.

Do you think NanoPlot could see performance improvements with a transition to Rust or by incorporating libraries such as Intel ISA-L? (https://github.com/pycompression/python-isal). I am just wondering what could potentially hinder its performance, and if SeqIO is mainly used right now for gzip handling.

I’d appreciate any insights or suggestions you may have on speeding up the process.

Best regards,

Serge

@wdecoster
Copy link
Owner

Hi Serge,

Yes, improvements in performance would be great. The multiple threads indeed only make a difference when multiple files are provided by input, and SeqIO performs all sequence parsing. So multiple options exist to improve this, including Rust/PyO3, or other libraries.

That said, I don't have much time to invest in this, but I definitely welcome any contribution. Now, it would be interesting to check the log files of your process to figure out in which steps the most time is spent, i.e., parsing input vs. plotting (or something else), to make sure effort is directed to the right bit of code to optimize.

As an alternative, you could consider running cramino at the (u)bam/cram stage. Cramino also has a --arrow argument for saving read lengths and quality scores to an arrow file, which is compatible with NanoPlot and NanoComp for plots...

Best,
Wouter

@SergeWielhouwer
Copy link
Author

SergeWielhouwer commented Dec 12, 2024

Hi Wouter,

Thanks for your speedy feedback.

I will consider running cramino and explore the code base and logs a bit further. Maybe I can work around the speed issue I am encountering using some minor changes in either NanoPlot or on the command-line itself :).

Best,

Serge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants