Skip to content

Commit

Permalink
RELEASE: update docs for new zipfile support! (#110)
Browse files Browse the repository at this point in the history
* update docs for zip stuff

* bump version

* Update doc/README.md

Co-authored-by: Tessa Pierce Ward <[email protected]>

* update main README, cleanup doc/README

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
  • Loading branch information
ctb and bluegenes authored Sep 14, 2023
1 parent 257015b commit 6e784fa
Show file tree
Hide file tree
Showing 5 changed files with 44 additions and 35 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pyo3-branchwater"
version = "0.7.1"
version = "0.8.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand Down
31 changes: 10 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,42 +46,31 @@ pip install -e .

### 3. Download sketches.

The following commands will download sourmash sketches for the podar genomes and
unpack them into the directory `podar-ref/`:
The following commands will download sourmash sketches for the podar genomes into the file `podar-ref.zip`:

```
mkdir -p podar-ref
curl -JLO https://osf.io/4t6cq/download
unzip -u podar-reference-genomes-updated-sigs-2017.06.10.zip
```

### 4. Create lists of query and subject files.

`multisearch` takes in lists of signatures to search, so we need to
create those files:

```
ls -1 podar-ref/{2,47,63}.* > query-list.txt
ls -1 podar-ref/* > podar-ref-list.txt
curl -L https://osf.io/4t6cq/download -o podar-ref.zip
```

### 5. Execute!

Now run `multisearch`:
Now run `multisearch` to search all the sketches against each other:
```
sourmash scripts multisearch query-list.txt podar-ref-list.txt -o results.csv --cores 4
sourmash scripts multisearch podar-ref.zip podar-ref.zip -o results.csv --cores 4
```

You will (hopefully ;)) see a set of results in `results.csv`. These are comparisons of each query against all matching genomes.

## Debugging help

If your file lists are not working properly, try running:
If your collections aren't loading properly, try running `sourmash sig summarize` on them,
like so:

```
sourmash sig summarize query-list.txt
sourmash sig summarize podar-ref-list.txt
sourmash sig summarize podar-ref.zip
```
to make sure everything can be loaded.

This will make sure everything can be loaded properly.

## Future thoughts

Expand Down
40 changes: 29 additions & 11 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,29 @@ This repository implements five sourmash plugins, `manysketch`, `fastgather`, `f

The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. In particular, this means that input databases need to be prepared differently, and the output may be most useful as a prefilter in conjunction with regular sourmash commands.

## Preparing the database
## Preparing the search and query databases.

`manysketch` requires a `fromfile` csv with columns `name,genome_filename,protein_filename`. If you don't have protein_filenames, be sure to include the trailing comma so the csv reader can process the file correctly. All four search commands use _text files containing lists of signature files_, or "fromfiles" for the search database. `multisearch`, `manysearch` and `fastmultigather` also use "fromfiles" for queries, too.
`manysketch` requires a `fromfile` csv with columns `name,genome_filename,protein_filename`. If you don't have `protein_filename` entries, be sure to include the trailing comma so the csv reader can process the file correctly.

(Yes, this plugin will eventually be upgraded to support zip files; keep an eye on [sourmash#2230](https://github.com/sourmash-bio/sourmash/pull/2230).)
All four search/gather commands use either zip files or _text files containing lists of signature files_ ("fromfiles") for the search database. `multisearch`, `manysearch` and `fastmultigather` also use either zips or "fromfiles" for queries, too.

### Using zip files

Zip files are used in two ways, depending on how the command works.

If the command loads a collection of sketches into memory at the start, then the sketches from the zip file are simply loaded into memory! So,
* `multisearch` loads both query and database into memory;
* `manysearch` loads the queries into memory;
* `fastmultigather` loads the search database into memory;

If the command loads a collection of sketches throughout execution, then the zip file is _unpacked_ to a temporary directory and the sketches are loaded from there. (This can consume a lot of extra disk space!) So,
* `manysearch` loads the sketches being searched this way;
* `fastgather` loads the database sketches this way;
* `fastmultigather` loads the query sketches this way;

Note that the temp directory is created under the path specified in the `TMPDIR` environment variable if it is set, otherwise it returns `/tmp`.

### Using "fromfiles"

To prepare a **signature** fromfile from a database, first you need to split the database into individual files:
```
Expand All @@ -27,7 +45,7 @@ find gtdb-reps-rs214-k21/ -name "*.sig.gz" -type f > list.gtdb-reps-rs214-k21.tx

### Running `manysketch`

The `manysketch` command sketches one or more fastas into a zipped sourmash signature collection (`zip`).
The `manysketch` command sketches one or more FASTA files into a zipped sourmash signature collection (`zip`).

To run `manysketch`, you need to build a text file list of fasta files, with one on each line (`fa.csv`, below). You can then run:

Expand All @@ -51,12 +69,12 @@ sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund

The `multisearch` command compares one or more query genomes, and one or more subject genomes. It differs from `manysearch` by loading all genomes into memory.

`multisearch` takes two file lists as input, and outputs a CSV:
`multisearch` takes two input collections (zip or "fromfiles"), and outputs a CSV:
```
sourmash scripts multisearch query-list.txt podar-ref-list.txt -o results.csv
```

To run it, you need to provide two "fromfiles" containing lists of paths to signature files (`.sig` or `.sig.gz`). If you create a fromfile as above with GTDB reps, you can generate a query fromfile like so:
To run it, you need to provide two collections of signature files. If you create a fromfile as above with GTDB reps, you can generate a query fromfile like so:

```
head -10 list.gtdb-reps-rs214-k21.txt > list.query.txt
Expand All @@ -73,7 +91,7 @@ The results file here, `query.x.gtdb-reps.csv`, will have 8 columns: `query` and

The `fastgather` command is a much faster version of `sourmash gather`.

`fastgather` takes a query metagenome and a file list as the database, and outputs a CSV:
`fastgather` takes a query metagenome and an input collection (zip or "fromfile") as database, and outputs a CSV:
```
sourmash scripts fastgather query.sig.gz podar-ref-list.txt -o results.csv --cores 4
```
Expand Down Expand Up @@ -105,7 +123,7 @@ A complete example Snakefile implementing the above workflow is available [in th

### Running `fastmultigather`

`fastmultigather` takes a file list of query metagenomes and a file list for the database, and outputs many CSVs:
`fastmultigather` takes a collection of query metagenomes and a collection of sketches as a database, and outputs many CSVs:
```
sourmash scripts fastmultigather query-list.txt podar-ref-lists.txt --cores 4
```
Expand All @@ -116,15 +134,15 @@ The main advantage that `fastmultigather` has over running `fastgather` on multi

`fastmultigather` will output two CSV files for each query, a `prefetch` file containing all overlapping matches between that query and the database, and a `gather` file containing the minimum metagenome cover for that query in the database.

The prefetch CSV will be named `{basename}.prefetch.csv`, and the gather CSV will be named `{basename}.gather.csv`. Here, `{basename}` is the filename, stripped of its path.
The prefetch CSV will be named `{basename}.prefetch.csv`, and the gather CSV will be named `{basename}.gather.csv`. Here, `{basename}` is the filename, stripped of its path. If zipfiles are used, `{basename}` will be the md5sum.

**Warning:** At the moment, if two different queries have the same `{basename}`, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.

### Running `manysearch`

The `manysearch` command compares one or more query sketches, and one or more subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.
The `manysearch` command compares one or more collections of query sketches, and one or more collections of subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.

`manysearch` takes two file lists as input, and outputs a CSV:
`manysearch` takes two collections as input, and outputs a CSV:
```
sourmash scripts manysearch query-list.txt podar-ref-list.txt -o results.csv
```
Expand Down
4 changes: 3 additions & 1 deletion src/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
The Rust source is in `lib.rs`.
# Source code guide

The pyo3 Rust/Python interface code is in `lib.rs`. The top level Rust functions called by the pyo3 Rust code are in individual files named for the function, and common utility code is in `utils.rs`.

The Python source code is under `python/`, and tests under `python/tests/`.

0 comments on commit 6e784fa

Please sign in to comment.