-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revisiting MAGsearch/searching all the SRA, and manifests of manifests #1685
Comments
updated location of mom db and zips to |
looks like ~1 hour to search all the SRA metagenomes (~500,000?) with a single genome, using the solution in #1664 with 16 threads. a few notes -
all in all quite pleased :) |
34 minutes, 16 threads. |
64 threads => 20 minutes.
probably hitting disk contention at this point! |
luiz comments on gitter:
|
Note, the output of 2021-sourmash-greymake is several thousand CSV files that are the output of
|
most of this functionality is now available natively in sourmash (or will be, with sourmash v4.4.0), which includes |
So, I did this thing, based on the manifests-of-manifests work #1652 and parallelizing SRA search code #1664.
convert wort-sra over to zipfiles by throwing away 90% of the data
First, for simplicity and speed of experimentation, I converted all of the digested SRA datasets to scaled=10k / k=21 using the following Snakefile.
Important note: I'd already used
split
to break the list of all 2.4m SRA signature files infarm:/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs
into 2412 files containing 1000 signature files each, with names matchingx??
andx????
.This snakefile took about two weeks to run and produced 2412 zipfiles named
x*.k21.10k.zip
, each containing 1000 signatures. (These currently reside infarm:~ctbrown/2021-sourmash-mom/listings.wort-sra/
.)Not so incidentally, these zipfile collections contain manifests.
indexing files into manifest-of-manifests
I then used the same procedure as in #1652 to build a sqlite3 database containing the aggregated manifests across all the things.
last, cross-correlate SRA sigs with SRA source listing
Finally, I used
mom-extract-sigs
with picklists, just as in this comment, to quickly discover which SRA accessions were missing from our collection.The text was updated successfully, but these errors were encountered: