RELEASE: update docs for new zipfile support! (#110)

* update docs for zip stuff * bump version * Update doc/README.md Co-authored-by: Tessa Pierce Ward <[email protected]> * update main README, cleanup doc/README --------- Co-authored-by: Tessa Pierce Ward <[email protected]>
sourmash-bio · Sep 14, 2023 · 6e784fa · 6e784fa
1 parent 257015b
commit 6e784fa
Show file tree

Hide file tree

Showing 5 changed files with 44 additions and 35 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "pyo3-branchwater"
-version = "0.7.1"
+version = "0.8.0"
 edition = "2021"
 
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

diff --git a/README.md b/README.md
@@ -46,42 +46,31 @@ pip install -e .
 
 ### 3. Download sketches.
 
-The following commands will download sourmash sketches for the podar genomes and
-unpack them into the directory `podar-ref/`:
+The following commands will download sourmash sketches for the podar genomes into the file `podar-ref.zip`:
 
 ```
-mkdir -p podar-ref
-curl -JLO https://osf.io/4t6cq/download
-unzip -u podar-reference-genomes-updated-sigs-2017.06.10.zip
-```
-
-### 4. Create lists of query and subject files.
-
-`multisearch` takes in lists of signatures to search, so we need to
-create those files:
-
-```
-ls -1 podar-ref/{2,47,63}.* > query-list.txt
-ls -1 podar-ref/* > podar-ref-list.txt
+curl -L https://osf.io/4t6cq/download -o podar-ref.zip
 ```
 
 ### 5. Execute!
 
-Now run `multisearch`:
+Now run `multisearch` to search all the sketches against each other:
 ```
-sourmash scripts multisearch query-list.txt podar-ref-list.txt -o results.csv --cores 4
+sourmash scripts multisearch podar-ref.zip podar-ref.zip -o results.csv --cores 4
 ```
 
 You will (hopefully ;)) see a set of results in `results.csv`. These are comparisons of each query against all matching genomes.
 
 ## Debugging help
 
-If your file lists are not working properly, try running:
+If your collections aren't loading properly, try running `sourmash sig summarize` on them,
+like so:
+
 ```
-sourmash sig summarize query-list.txt
-sourmash sig summarize podar-ref-list.txt
+sourmash sig summarize podar-ref.zip
 ```
-to make sure everything can be loaded.
+
+This will make sure everything can be loaded properly.
 
 ## Future thoughts
 

diff --git a/doc/README.md b/doc/README.md
@@ -4,11 +4,29 @@ This repository implements five sourmash plugins, `manysketch`, `fastgather`, `f
 
 The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. In particular, this means that input databases need to be prepared differently, and the output may be most useful as a prefilter in conjunction with regular sourmash commands.
 
-## Preparing the database
+## Preparing the search and query databases.
 
-`manysketch` requires a `fromfile` csv with columns `name,genome_filename,protein_filename`. If you don't have protein_filenames, be sure to include the trailing comma so the csv reader can process the file correctly. All four search commands use _text files containing lists of signature files_, or "fromfiles" for the search database. `multisearch`, `manysearch` and `fastmultigather` also use "fromfiles" for queries, too.
+`manysketch` requires a `fromfile` csv with columns `name,genome_filename,protein_filename`. If you don't have `protein_filename` entries, be sure to include the trailing comma so the csv reader can process the file correctly.
 
-(Yes, this plugin will eventually be upgraded to support zip files; keep an eye on [sourmash#2230](https://github.com/sourmash-bio/sourmash/pull/2230).)
+All four search/gather commands use either zip files or _text files containing lists of signature files_ ("fromfiles") for the search database. `multisearch`, `manysearch` and `fastmultigather` also use either zips or "fromfiles" for queries, too.
+
+### Using zip files
+
+Zip files are used in two ways, depending on how the command works.
+
+If the command loads a collection of sketches into memory at the start, then the sketches from the zip file are simply loaded into memory! So,
+* `multisearch` loads both query and database into memory;
+* `manysearch` loads the queries into memory;
+* `fastmultigather` loads the search database into memory;
+
+If the command loads a collection of sketches throughout execution, then the zip file is _unpacked_ to a temporary directory and the sketches are loaded from there. (This can consume a lot of extra disk space!) So,
+* `manysearch` loads the sketches being searched this way;
+* `fastgather` loads the database sketches this way;
+* `fastmultigather` loads the query sketches this way;
+
+Note that the temp directory is created under the path specified in the `TMPDIR` environment variable if it is set, otherwise it returns `/tmp`.
+
+### Using "fromfiles"
 
 To prepare a **signature** fromfile from a database, first you need to split the database into individual files:
 ```
@@ -27,7 +45,7 @@ find gtdb-reps-rs214-k21/ -name "*.sig.gz" -type f > list.gtdb-reps-rs214-k21.tx
 
 ### Running `manysketch`
 
-The `manysketch` command sketches one or more fastas into a zipped sourmash signature collection (`zip`).
+The `manysketch` command sketches one or more FASTA files into a zipped sourmash signature collection (`zip`).
 
 To run `manysketch`, you need to build a text file list of fasta files, with one on each line (`fa.csv`, below). You can then run:
 
@@ -51,12 +69,12 @@ sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund
 
 The `multisearch` command compares one or more query genomes, and one or more subject genomes. It differs from `manysearch` by loading all genomes into memory.
 
-`multisearch` takes two file lists as input, and outputs a CSV:
+`multisearch` takes two input collections (zip or "fromfiles"), and outputs a CSV:
 ```
 sourmash scripts multisearch query-list.txt podar-ref-list.txt -o results.csv
 ```
 
-To run it, you need to provide two "fromfiles" containing lists of paths to signature files (`.sig` or `.sig.gz`). If you create a fromfile as above with GTDB reps, you can generate a query fromfile like so:
+To run it, you need to provide two collections of signature files. If you create a fromfile as above with GTDB reps, you can generate a query fromfile like so:
 
 ```
 head -10 list.gtdb-reps-rs214-k21.txt > list.query.txt
@@ -73,7 +91,7 @@ The results file here, `query.x.gtdb-reps.csv`, will have 8 columns: `query` and
 
 The `fastgather` command is a much faster version of `sourmash gather`.
 
-`fastgather` takes a query metagenome and a file list as the database, and outputs a CSV:
+`fastgather` takes a query metagenome and an input collection (zip or "fromfile") as database, and outputs a CSV:
 ```
 sourmash scripts fastgather query.sig.gz podar-ref-list.txt -o results.csv --cores 4
 ```
@@ -105,7 +123,7 @@ A complete example Snakefile implementing the above workflow is available [in th
 
 ### Running `fastmultigather`
 
-`fastmultigather` takes a file list of query metagenomes and a file list for the database, and outputs many CSVs:
+`fastmultigather` takes a collection of query metagenomes and a collection of sketches as a database, and outputs many CSVs:
 ```
 sourmash scripts fastmultigather query-list.txt podar-ref-lists.txt --cores 4
 ```
@@ -116,15 +134,15 @@ The main advantage that `fastmultigather` has over running `fastgather` on multi
 
 `fastmultigather` will output two CSV files for each query, a `prefetch` file containing all overlapping matches between that query and the database, and a `gather` file containing the minimum metagenome cover for that query in the database.
 
-The prefetch CSV will be named `{basename}.prefetch.csv`, and the gather CSV will be named `{basename}.gather.csv`.  Here, `{basename}` is the filename, stripped of its path.
+The prefetch CSV will be named `{basename}.prefetch.csv`, and the gather CSV will be named `{basename}.gather.csv`.  Here, `{basename}` is the filename, stripped of its path. If zipfiles are used, `{basename}` will be the md5sum.
 
 **Warning:** At the moment, if two different queries have the same `{basename}`, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.
 
 ### Running `manysearch`
 
-The `manysearch` command compares one or more query sketches, and one or more subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.
+The `manysearch` command compares one or more collections of query sketches, and one or more collections of subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.
 
-`manysearch` takes two file lists as input, and outputs a CSV:
+`manysearch` takes two collections as input, and outputs a CSV:
 ```
 sourmash scripts manysearch query-list.txt podar-ref-list.txt -o results.csv
 ```

diff --git a/src/README.md b/src/README.md
@@ -1,3 +1,5 @@
-The Rust source is in `lib.rs`.
+# Source code guide
+
+The pyo3 Rust/Python interface code is in `lib.rs`. The top level Rust functions called by the pyo3 Rust code are in individual files named for the function, and common utility code is in `utils.rs`.
 
 The Python source code is under `python/`, and tests under `python/tests/`.