Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revisiting MAGsearch/searching all the SRA, and manifests of manifests #1685

Closed
ctb opened this issue Jul 24, 2021 · 7 comments
Closed

revisiting MAGsearch/searching all the SRA, and manifests of manifests #1685

ctb opened this issue Jul 24, 2021 · 7 comments
Labels
magsearch MAGsearch - search all the things

Comments

@ctb
Copy link
Contributor

ctb commented Jul 24, 2021

So, I did this thing, based on the manifests-of-manifests work #1652 and parallelizing SRA search code #1664.

convert wort-sra over to zipfiles by throwing away 90% of the data

First, for simplicity and speed of experimentation, I converted all of the digested SRA datasets to scaled=10k / k=21 using the following Snakefile.

Important note: I'd already used split to break the list of all 2.4m SRA signature files in farm:/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs into 2412 files containing 1000 signature files each, with names matching x?? and x????.

This snakefile took about two weeks to run and produced 2412 zipfiles named x*.k21.10k.zip, each containing 1000 signatures. (These currently reside in farm:~ctbrown/2021-sourmash-mom/listings.wort-sra/.)

Not so incidentally, these zipfile collections contain manifests.

# convert to scaled=10k/k21 zip files
import glob

filelists = glob.glob("x??")
filelists += glob.glob("x????")

print(f"loaded {len(filelists)}")


rule all:
    input:
        expand("{f}.k21.10k.zip", f=filelists)

rule do:
    input: "{f}"
    output: "{f}.k21.10k.zip"
    shell: """
        sourmash sig downsample --from-file {input} -k 21 --scaled=10000 \
            -o {output} -f
    """

indexing files into manifest-of-manifests

I then used the same procedure as in #1652 to build a sqlite3 database containing the aggregated manifests across all the things.

./mom-create.py -o wort-sra.zips.db listings.wort-sra

last, cross-correlate SRA sigs with SRA source listing

Finally, I used mom-extract-sigs with picklists, just as in this comment, to quickly discover which SRA accessions were missing from our collection.

./mom-extract-sigs.py --picklist /group/ctbrowngrp/irber/sra_search/inputs/metagenomes_source-20210416.csv:Run:ident wort-sra.zips.db
NOTE: no ksize/moltype selector given. Are you sure?
picking column 'Run' of type 'ident' from '/group/ctbrowngrp/irber/sra_search/inputs/metagenomes_source-20210416.csv'
loaded 657942 distinct values into picklist.
Loading MoM sqlite database wort-sra.zips.db...
wort-sra.zips.db contains 2411771 rows total. Running select......
...545160 matches remaining for 'wort-sra.zips.db' (20.5s)
---
loaded 545160 rows total from 1 databases.
for given picklist, found 545160 matches to 657942 distinct values
WARNING: 112782 missing picklist values.
There are 545160 distinct rows across all MoMs.
No output options; exiting.
@ctb ctb added the magsearch MAGsearch - search all the things label Jul 24, 2021
@ctb
Copy link
Contributor Author

ctb commented Jul 24, 2021

updated location of mom db and zips to /home/ctbrown/2021-sourmash-mom/wort-sra.zips, index.db and *.zip.

@ctb
Copy link
Contributor Author

ctb commented Jul 24, 2021

looks like ~1 hour to search all the SRA metagenomes (~500,000?) with a single genome, using the solution in #1664 with 16 threads.

a few notes -

  • this is a single genome, at k=21, with scaled=10000
  • probably a lot of the speedup comes from using scaled=10000, which is much larger (5x? 10x?) than the original used by MAGsearch. so we're losing a lot of resolution.
  • another big speedup probably comes from loading the signatures we want - because we're using k=21 on zipfile collections, the k=31 and k=51 don't get loaded, unlike with the .sig files.

all in all quite pleased :)

@ctb
Copy link
Contributor Author

ctb commented Jul 24, 2021

34 minutes, 16 threads.

@ctb
Copy link
Contributor Author

ctb commented Jul 24, 2021

64 threads => 20 minutes.

        Command being timed: "snakemake -j 64"
        User time (seconds): 29521.32
        System time (seconds): 1021.04
        Percent of CPU this job got: 2591%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 19:38.62
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1722940
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 450
        Minor (reclaiming a frame) page faults: 403399023
        Voluntary context switches: 3814730
        Involuntary context switches: 4019656
        Swaps: 0
        File system inputs: 598142080
        File system outputs: 80696
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

probably hitting disk contention at this point!

@ctb
Copy link
Contributor Author

ctb commented Jul 24, 2021

luiz comments on gitter:

timing-wise I guess it mostly comes from scaled=10000, it's a similar ballpark to the Rust one if you multiply the times by 10

@ctb
Copy link
Contributor Author

ctb commented Jul 25, 2021

Note, the output of 2021-sourmash-greymake is several thousand CSV files that are the output of sourmash search -- note that similarity, below, is actually containment in this case (because I ran with sourmash search --containment).

% csvtk cut -f similarity,name,query_name xzcpt.k21.10k.search.csv | csvtk pretty
similarity            name         query_name
-------------------   ----------   -------------------------------------------------------------------
0.14257143368776093   SRR1159037   GCF_000742135.1 Klebsiella pneumoniae strain=ATCC 13883, ASM74213v1
0.11332351118730499   ERR3593201   GCF_000742135.1 Klebsiella pneumoniae strain=ATCC 13883, ASM74213v1

@ctb
Copy link
Contributor Author

ctb commented May 1, 2022

most of this functionality is now available natively in sourmash (or will be, with sourmash v4.4.0), which includes StandaloneManifestIndex, and `sig check', and SQLite manifests #1808.

@ctb ctb closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
magsearch MAGsearch - search all the things
Projects
None yet
Development

No branches or pull requests

1 participant