Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large number of nodes remain in component #24

Open
jnwoodhouse opened this issue Oct 12, 2021 · 7 comments
Open

Very large number of nodes remain in component #24

jnwoodhouse opened this issue Oct 12, 2021 · 7 comments

Comments

@jnwoodhouse
Copy link

I am analysing some freshwater metagenomes and we really have no idea what to expect when it comes to plasmids (ie size),

I am somewhat stupidly stuck on the "first" sample which takes a very long time to run. What is obvious is that as it goes through what im guessing is the "peeling off" process that the time taken to consider each component is relative to the number of nodes in the graph.

For instance initially it reports (quite quickly)

252 nodes remain in component
243 nodes remain in component
217 nodes remain in component
201 nodes remain in component
194 nodes remain in component
182 nodes remain in component
182 nodes remain in component

but then much later (around 48h later)

its displaying

2355264 nodes remain in component
2355264 nodes remain in component
2355264 nodes remain in component
2355264 nodes remain in component
2355264 nodes remain in component
2355262 nodes remain in component
2355262 nodes remain in component
2355262 nodes remain in component

A single "peeling" process takes around 5 h and you see the number of nodes remaining often is not reduced, only 2 nodes in total.

Im wondering: Is this something that should be discarded (super high number of nodes), is there a setting that im missing that would circumvent such graph components being profiled? Or should i be getting super excited and ready to write a Nature paper??

@dpellow
Copy link
Collaborator

dpellow commented Oct 12, 2021

Yes you are correct, SCAPP processes the assembly graph one component at a time and the processing time for each component increases with the number of nodes it contains.

There is no direct way to get around this (although you could increase the number of threads if you have the compute resources). However, more than 2 million nodes in a component sounds very large, maybe there is a way to construct a smaller graph (that better assembles the sequence into fewer but longer nodes): how many reads is the sample? What assembler did you use to assemble it? What was the value of k?

@jnwoodhouse
Copy link
Author

The sample has around 100M reads (14,6B bases)

We ran spades in meta mode, the max K should be 77. We have over 1000 samples as part of this project and we have already done the assembly and binning for all samples. I was optimistic that we could look through the samples also for plasmids but obviously if it takes 5+ days per sample we might have a small problem. Currently I am running on 1.5TB cluster with the option for up to 96 threads.

Related to this I wonder if you can specify how SCAPP handles memory use. I noticed, when running on a smaller system that each process takes around 20 gb of memory so if you have more than 10 threads running you start to run out of memory. I somewhat calculated this back on the larger cluster so that I limited to 65 threads (65*20 = 1.3TB) so that there would be a bit of leeway.

@dpellow
Copy link
Collaborator

dpellow commented Oct 14, 2021

OK, it sounds like the samples are pretty big and the metagenomes are probably diverse so the graph will be very large. On large samples we have had success using larger values of maximum k such as 91 or 127 to construct less fragmented metagenomes with longer contigs and fewer nodes in the assembly graph (depends on the read length as well - longer reads will work better with larger k). That might be something you're interested in looking into as well.

SCAPP is written in Python and the multiprocessing support of Python is admittedly not great... Indeed some shared data structures are duplicated in each process so the memory usage will grow with the number of processes (a 200GB server seems a bit low to work with such large metagenomes).

If using larger k to simplify the fragmented graph isn't enough, then I think the correct thing to happen here is that we modify SCAPP to have a max component size parameter and any components larger than this would be broken up into smaller components. You would lose plasmids that cross the boundaries between slices of the original component, but it would allow extremely large components to be processed in a reasonable amount of time, so there is a trade-off.

It would take me a bit of time to get to implementing, testing and releasing this sort of modification so it depends how time sensitive your plasmid assembly is if it will help you.

@jnwoodhouse
Copy link
Author

Hey David,

We are not specifically time-sensitive at this stage. The job is still running and has managed to peel of 14 nodes. I think I will hit my time limit soon and it will end. On this note while im making demands and creating work for you, completely lacking any knowledge of Python, maybe there would be an option to add checkpoints to the workflow so that it can continue using already generated files ie plasclass.out

I think our options for going back and redoing the assemblies with a higher k value are limited, we are relying on the charity of others. We are not strictly targetting mobile elements with our study, but have a nice time-series so we were interested in seeing what we could pull out. Like i said im stuck on the first sample, so maybe its time to work through the other 1000+ samples and see how often this pops up as a problem. Maybe it is not so common.

@dpellow
Copy link
Collaborator

dpellow commented Oct 14, 2021

OK, I will try to implement this, but it would probably only get done in ~2-3 weeks...

There are options to use the intermediate files, if you need to run again on the same input sample I would definitely recommend using them to save a lot of time: -b <bam file> instead of the read files options -r1 and -r2, -pc <plasclass file>. At the stage the sample you are running is currently at, both of these files should be in the intermediate_files subdirectory of your output directory.

I'd guess that if the other samples are all of similar read depths from similar environments they will run into the same issue.

(Note that SPAdes also has the --restart_from option that would actually let you continue running with an expanded set of values for k rather than re-running all the assemblies from the beginning if that is an option)

@jnwoodhouse
Copy link
Author

Classic, of course I miss the -pc flag. Sorry,

Regarding spades restart, unfortunately we had to remove the intermediate files to save physical space.

@dpellow
Copy link
Collaborator

dpellow commented Oct 26, 2021

@jnwoodhouse I am starting to work on modifying SCAPP to first break up any extremely large graph components into smaller chunks (using the Louvain algorithm).

I don't have any assembly graphs that are on the order of magnitude of the sample that you are running SCAPP on. Would you be willing to share the fastg file so I could use it to tune the parameters of the algorithm and get chunks of the right size distribution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants