Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we turn picklists + collections into manifests? #3048

Closed
ctb opened this issue Feb 27, 2024 · 7 comments · Fixed by #3027
Closed

can we turn picklists + collections into manifests? #3048

ctb opened this issue Feb 27, 2024 · 7 comments · Fixed by #3027

Comments

@ctb
Copy link
Contributor

ctb commented Feb 27, 2024

along the theme of cool ways to subset collections with manifests, @AnneliektH is converting fastgather results into manifests here. This is presumably because mgmanysearch doesn't support picklists (for which fastgather output could then be used), but does supports standalone manifests.

so the question du jour is: do we have a standard way to go from picklists + collections => standalone manifest?

I think sourmash sig check might do it: docs. I will check and then recommend it to annie if so :).

it looks like sourmash sig collect does not, however: docs. That was my first guess.

related:

@ctb
Copy link
Contributor Author

ctb commented Feb 27, 2024

sig check seems to work for this!

Build a bunch of individual .sig.zip files:

for i in *.fa
do
   sourmash sketch dna $i -o $i.sig.zip --name-from-first
done

create a picklist for some of them in ident.list

ident
CP001472.1
CP001941.1
CP001071.1
AE000782.1
NC_003272.1

then run sig check --picklist ... -m mf.csv to create a standalone manifest of just the sketches that match to the picklist:

sourmash sig check *.sig.zip --picklist ident.list:ident:ident  -m mf.csv

and

sourmash sig summarize mf.csv

shows just five sketches in the standalone manifest mf.csv, which looks like this:

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
0.fa.sig.zip,324074c7287ed934af4fd0a6a459aa30,324074c7,31,DNA,0,1000,4168,False,"CP001472.1 Acidobacterium capsulatum ATCC 51196, complete genome",0.fa
1.fa.sig.zip,c11126d0591db94cd3d1c8568499375f,c11126d0,31,DNA,0,1000,1478,False,"CP001941.1 Aciduliprofundum boonei T469, complete genome",1.fa
2.fa.sig.zip,f3a90d4e5528864a5bcc8434b0d0c3b1,f3a90d4e,31,DNA,0,1000,2701,False,"CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome",2.fa
3.fa.sig.zip,cee0a3fb7b00990a22be02b2b0a78418,cee0a3fb,31,DNA,0,1000,2143,False,"AE000782.1 Archaeoglobus fulgidus DSM 4304, complete genome",3.fa
35.fa.sig.zip,264cfdad44548ad96c4a24b6a514a877,264cfdad,31,DNA,0,1000,7295,False,"NC_003272.1 Nostoc sp. PCC 7120 DNA, complete genome",35.fa

@ctb
Copy link
Contributor Author

ctb commented Feb 27, 2024

@AnneliektH maybe worth giving it a try ;)

@AnneliektH
Copy link

Ok so this would work perfectly, but I did something wrong I think while creating all signatures initially.
I have a zip file that contains all signatures in sourmash/sig_files/signatures_concat/allMAGs.zip, where all zips come from sourmash/sig_files/MAGs.
For most of them , the match_name in a fastgather is only the MAG name, which works (e.g. "SRR8960976_MAG07"), but for some, the match_name is a whole file path to a MAG (eg. "../atlas/MAGs/genomes/all_fasta/fastafiles/AtH2023_SRR8960918_MAG18.fasta") and for those it does not work.
I don't know what i did wrong building signature files that for some I get this and not for others..

Just rebuild all signatures?

@ctb
Copy link
Contributor Author

ctb commented Feb 27, 2024

that's up to you 😆 - depends on what's easiest. we have sig rename to rename sketches, or you can go back and rebuild them with sourmash sketch (and provide --name this time), or use manysketch from the branchwater plugin to quickly rebuild them with nicer names, or whatever!

you can also ignore all of that and use the md5sum in the picklist, instead of the name/ident, which should also work.

@AnneliektH
Copy link

Ok so its almost working..
snakefile here: https://github.com/AnneliektH/2023-swine-sra/blob/main/sourmash/Snakefile_mg

Snakefile makes the manifests (in /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/manifests/MAGs), where there is a manifest for each clustering treshold of genomes.
Running all of this from the folder 'sourmash', folders described below are subfolders in that folder.

When running mgmanysearch, I get an error, which I think has to do with the file paths of the queries (aka the manifests): Within the manifest, the internal location is correctly stated as "sig_files/signatures_concat/MAGs2.zip"
The manifests themselves are saved in the folder manifests/MAGs/
Now when running mgmanysearch, the error is "ValueError: Error while reading signatures from 'manifests/MAGs/sig_files/signatures_concat/MAGs2.zip"

Seems like it pastes the manifest location in front of the query location, which isnt where the files at

@ctb
Copy link
Contributor Author

ctb commented Mar 3, 2024

Seems like it pastes the manifest location in front of the query location, which isnt where the files at

Spent time figuring this all out (see #3053 for a demonstration), and I think that the mgmanysearch behavior here is correct and that sig check is creating manifests incorrectly - see breakdown & discussion here, #3008 (comment). I'm fixing the behavior over in #3054.

@ctb
Copy link
Contributor Author

ctb commented Mar 3, 2024

(once #3054 is merged, solution will be to use sourmash sig check --relpath)

ctb added a commit that referenced this issue Mar 8, 2024
…` and `sig collect` (#3054)

This PR updates `sig collect` and `sig check` so that they can produce
standalone manifests that work properly with default sourmash loading
behavior. The default behavior produces broken manifests in some
situations and is not changed, but will be deprecated in v5.

## Details

Currently, `sig collect` and `sig check` default to producing standalone
manifests with internal path locations relative to the current working
directory. This conflicts with the default `StandaloneManifest` behavior
implemented in `save_load.py` that loads path locations relative to the
manifest location. As a result, whenever the manifest was in a
subdirectory, the standalone manifests output by `sig check` and `sig
collect` were broken. The only way to make good manifests in this
situation was to use `sig collect --abspath`, but `sig check` didn't
support `--abspath`, and using absolute paths is brittle in situations
where you want to distribute manifests.

This PR adds `--relpath` to both `sig check` and `sig collect`, and adds
`--abspath` to `sig check`. It also demonstrates the bad behavior in
tests and annotates the tests appropriately.

See
#3008 (comment)
for more detailed discussion of why I think `--relpath` is the right
behavior for the future.

- [x] adds `--abspath` and `--relpath` to `sig check`, to properly
support relative paths;
- [x] adds `--relpath` to `sig collect`, to properly support relative
paths;
- [x] documents this behavior properly for creating standalone
manifests;
- [ ] create issue to change default `sig check` and `sig collect`
behavior for v4, and disable cwd behavior.

Techie TODO:
- [x] explicitly test `relpath` and `abspath` behavior in `sig check`;
- [x] explicitly test `relpath` behavior in `sig collect`
- [x] write some tests for `sig check` and `sig collect` to explore the
relative path loading issue, with all three combinations of relpath: mf
in cwd, sigs in subdir; mf in subdir, sigs in cwd; mf in subdir, sigs in
subdir.

Related issues:
* Addresses #3008
* Addresses issues in
#3048 by updating `sig
check` to support `--relpath`;
* Fixes #3053 -
`--relpath` again

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ctb ctb closed this as completed in #3027 Mar 20, 2024
@ctb ctb closed this as completed in cfe6a96 Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants