-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make a sourmash sketch fromfile
to support large scale sketching.
#1671
Comments
this may actually be a real world use case for having the full genomes be available as "signatures" per #1647 |
a much simpler version is to just have the 'nomatch' file output by mom-extract-sigs be a full manifest... |
One way this could work if we enabled it in sourmash -
With picklist in |
I like Adding Backing up a bit, the functionality we need looks something like this: (1) given some input specification, one or more of -
(2) then inspect a manifest and figure out what remains to be calculated, (3) and then go do it. for (1), I think we can probably hack it together fairly easily. for (2), there's some interesting interaction to work out, but it should be easy to say "you want X? X isn't in the manifest." for (3), we need to have some way to connect identifiers to files, and to make things worse, we need to connect the same identifier to at least two different files - DNA and protein. so for now it seems like there are several steps to work out 😄 |
Returning to this because I really would like a way to name sigs when building many at once, e.g. from a directory or a file list. I guess we could add something like Perhaps what we want is a different standard picklist format, e.g.
We could unite this with While at it, I would imagine we would also want to add a I guess something I'm implicitly stating here is that I care far less about being able to select things from a use case for ref: Building large databases from many signatures makes for slow and/or intractable snakemake DAGs. I can use file lists for sketching and zipping, but I can't think of a way to (batch) name the sigs properly via sourmash cli. Am I forgetting something obvious!? edit: #1315 could be a way to make this work |
I haven't thought about this for a while, so apologies if this is obviously wrong - but, over in #1647, I made the comment that
I wonder if we could some thing where we unify this idea with selection/sketching and do something where when FASTA sequences are stored in signatures or directories or zipfiles, we can calculate the MinHash sketches and output them somewhere. At first glance it seems outside the current sourmash architecture but ... separately, I worry about integrating large-batch sketch calculation into sourmash's Python implementation because it would not be parallelized (due to Python limitations). |
We could make a base class To support the use case you're thinking of, we could then have two classes that inherit from the base class -- the fasta sketchtype, and the FracMinHash sketchtype (our current For the use case I've been thinking of, the base class could be used to store info from
two thoughts:
|
This is not a response to the above yet, but it reminds me of an idle thought I had yesterday - what about a new command |
this is the exact behavior I'd want, and I would happily use it in this format. It would be an added bonus to avoid needing to completely separate |
excellent ;). |
so, I poked around a little bit with this today, and came up with the following Python syntax: sketch_spec1 = MinHashSpec(ksizes=[31], output_type="DNA")
sketch_spec2 = MinHashSpec(ksizes=[10], output_type="protein")
source1 = SketchFromFile(name="sketch name goes here",
protein_filename="data/some_prot_file.faa",
dna_filename="data/GCF_000005845.2_ASM584v2_genomic.fna.gz")
source2 = SketchFromRecords(map_record_to_name=fn_or_mapping,
protein_filename="some_big_file_of_proteins.faa")
missing = check_specs_against_manifests([sketch_spec1, sketch_spec2],
[source1, source2], manifests) The idea is that you build spec objects that detail the various sketch types you're interested in building, and then cross-product them with the data sources, to create a list of the actual sketches you want built. This can then be checked against manifests of existing signatures. Questions:
Lots of additional thoughts too -
|
I like it. This is definitely the separation we need!
I think it would be ok to only enable Do we want to enable something separate like
starting with python code seems good! Not terribly hard to add file consumption after the structure is in place, right? Ultimately, for For Here's an idea for format:
...with The only thing I don't like is that if we use different formats ( |
may I suggest toml instead of YAML? It would look something like this: [[MinHashSpec.dna]]
ksize = [ 21, 31, 51 ]
scaled = 1_000
[[MinHashSpec.protein]]
input = [ "dna", "protein" ]
ksize = 10
scaled = [ 100, 200 ]
[[MinHashSpec.dayhoff]]
ksize = 16 Many places (including Python with pyproject.toml) are moving away from YAML because it has some confusing parsing issues (due to being underspecified). |
sourmash sketch
?sourmash sketch
? or, make a sourmash sketch fromfile
...
this issue has come back to the forefront of my brain because of dib-lab/genome-grist#130, where the construction of a private database is ...annoying because it's hard to properly name signatures in bulk. One thing that I did in dib-lab/genome-grist#130 that I kinda liked was to have the |
A specific proposal - I'm leaning towards implementing something like SEPARATELY, we would also add a So that way we'd have params cross-product sources. Finally, a separate piece of functionality would then be to enable |
ugh, running into this YET AGAIN in some work on sketching PFAM. Need/could use a standard way to check which files have not been sketched yet. 🎶 motivation 🎶 |
getting started with the construction side of the input CSV file here: https://github.com/ctb/2022-sourmash-sketchfrom |
This works and is not terribly hacky:
and it produces
and (crucially) has correctly associated the protein faa.gz with the nucleotide fna.gz, based on the accession in the filename. |
as a side note - you can already do this with |
Side thought: if we have a list of FASTA DNA and protein sequences together with names, we might need to have two such files if we build different names for GTDB and NCBI taxonomies. I think a better solution will be to provide a separate mass-renaming function, perhaps via |
renaming idea in: #1883 |
sourmash sketch
? or, make a sourmash sketch fromfile
...sourmash sketch fromfile
to support large scale sketching.
progress!! NOTE: uses https://github.com/ctb/2022-sourmash-sketchfrom, requires code in #1884 round 0: build a CSV file with source genome/protein informationStarting from a directory
we construct a CSV file that automagically builds GTDB names:
The resulting file contains ident, name, and source files for sourmash sketch to use:
round 1: sketch some stuffNow build the sketches:
results 🎉
round 2: try sketching the same stuff
no soup for you! round 3: try sketching the same stuff and some new stuffAdd
and voila, only the new stuff is there:
|
round 4 -construct names from GTDB taxonomy:
construct names from GenBank taxonomy:
and these files can be used to build signatures with different names: GTDB:
GenBank:
|
notes to self, collection/summary of the above comments and other thoughts
|
ok, so this is all lovely, but I think that I'm missing one of the key use cases we had in mind in the beginning: what about situations where you have a list of accessions (e.g. a listing of GenBank, or a GTDB release), and you want to build a comprehensive database, or check that you have the right set of genomes to build signatures for those accessions? |
do we want to do anything about |
latest update - bulk bulk bulk!
and indeed there is one duplicate identifer, tsk tsk 😆
|
Funny story! The original download of this file was empty - likely some download issue. I think the sketch may actually be empty in my tl;dr - this file is here because we ignore empty files when sketching, and we should definitely do something differently ;) ...and I shall go remove it now 💨 |
I would like if we could support
Other |
Better output - I fixed some reporting (but forgot to fix the duplicate entry, will do that now).
|
Most of these are entries that had no *faa.gz protein files (so I ran prodigal on the genomes instead). Forgot to give you the location for the prodigal proteomes: Probably more useful, I have a file with |
my mighty script should be able to figure it all out, given only the path (and an appropriately named set of files!) I will let it loose! 🦁 |
same naming convention, just a different location. Let er rip! 🦖 |
we're down to the following errors - after eliminating zero size files, we have:
the script is getting pretty good at this :). I haven't layered on the official list of identifiers yet, I'll try that. |
key new message: for given picklist, found 258406 matches to 258406 distinct values so except for the 11 entries with only genome (and no protein) files, we could now go forth and build 🎉 |
daaaaaaaamn
took about 3 seconds to run. (It just checked to see how many to build.) |
interesting ~new challenge to consider: I'm looking at this might be an opportunity to revisit the manifest-of-manifests thing, #1685. The basic idea:
|
Confronted the horror of too many interconnected things, went for a nice hike, realized that maybe if we just could load manifest CSV files directly on the command line, we could build tooling around creating/maintaining such manifests for large wort collections without adding a whole lot of complexity to sourmash. See #1891 for code. |
wort-genbank, all sigs, in a single manifest, using #1891:
Per
|
ok, to recap:
there's some support code in https://github.com/ctb/2022-sourmash-sketchfrom that we don't need to integrate into sourmash (or may never need) -
Things we could maybe use at this point, beyond making those experimental PRs mergable:
Things we are punting on, for now; we'll need to make issues for these:
|
finally realizing that @bluegenes knew what she was talking about all along in #1365 and now #1902, I implemented Here we are checking ~100 identifiers against a manifest for 4.7 million wort genbank sigs and pulling out the matching manifest entries:
...which can then be directly summarized, searched, loaded, etc:
(If there had been picklist values that weren't matched, they would have ended up in |
building genbank stuff from assembly reports for archaeaUsing code from #1885, download assembly_summary
construct identifier picklist
run
|
building gtdb genomic for rs207Looks like there might be a new GTDB release coming? See rs207 directory. download files with identifiers
see how many signatures we can retrieve from wortthen run
🎉 no missing signatures! wort rulez! examine output manifest for ksizes etc.Then examine the output manifest
construct new GTDB releasefor k=31:
|
|
some documentation and examples for the |
renamed to sourmash-bio/database-examples! And I'm closing this issue now that #1885 has been merged. 🎉 |
ref #1652, you want to be able to sketch certain ksizes/moltypes for certain identifiers only, but on a large scale. how can we best do this?
The text was updated successfully, but these errors were encountered: