diff --git a/doc/command-line.md b/doc/command-line.md index b93a6c7c10..99eb7d0d4a 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -5,7 +5,7 @@ ``` From the command line, sourmash can be used to create -[MinHash sketches][0] from DNA and protein sequences, compare them to +[FracMinHash sketches][0] from DNA and protein sequences, compare them to each other, and plot the results; these sketches are saved into "signature files". These signatures allow you to estimate sequence similarity and containment quickly and accurately in large @@ -45,7 +45,10 @@ Next, compare all the signatures to each other: sourmash compare *.sig -o cmp.dist ``` -Finally, plot a dendrogram: ``` sourmash plot cmp.dist --labels ``` +Finally, plot a dendrogram: +``` +sourmash plot cmp.dist --labels +``` This will output three files, `cmp.dist.dendro.png`, `cmp.dist.matrix.png`, and `cmp.dist.hist.png`, containing a clustering & dendrogram of the sequences, a similarity matrix and @@ -562,15 +565,16 @@ As with all reference-based analysis, results can be affected by the and redundancy of reference databases. For more details on how `gather` works and can be used to classify - signatures, see [classifying-signatures](classifying-signatures.md). +signatures, see ### `sourmash tax metagenome` - summarize metagenome content from `gather` results `sourmash tax metagenome` summarizes gather results for each query metagenome by - taxonomic lineage. +taxonomic lineage. -example command to summarize a single `gather csv`, where the query was gathered - against `gtdb-rs202` representative species database: +Here is an example command to summarize a single `gather csv`, where + the query was gathered against `gtdb-rs202` representative species +database: ``` sourmash tax metagenome @@ -589,10 +593,10 @@ The possible output formats are: #### `csv_summary` output format `csv_summary` is the default output format. This outputs a `csv` with lineage - summarization for each taxonomic rank. This output currently consists of six - columns, `query_name,rank,fraction,lineage,query_md5,query_filename`, where - `fraction` is the fraction of the query matched to the reported rank and - lineage. +summarization for each taxonomic rank. This output currently consists of six +columns, `query_name,rank,fraction,lineage,query_md5,query_filename`, where +`fraction` is the fraction of the query matched to the reported rank and +lineage. example `csv_summary` output from the command above: @@ -611,7 +615,7 @@ o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia; o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus ``` -> The `query_md5` and `query_filename` columns are omitted here for brevity. +The `query_md5` and `query_filename` columns are omitted here for brevity. #### `krona` output format @@ -1005,14 +1009,14 @@ commands. All `sourmash tax` commands must be given one or more taxonomy files as parameters to the `--taxonomy` argument. These files can be either CSV -files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases +files or (as of sourmash 4.2.1) SQLite databases. SQLite databases are much faster for large taxonomies, while CSV files are easier to view and modify using spreadsheet software. `sourmash tax prepare` is a utility function that can ingest and validate -multiple CSV files or sqlite3 databases, and output a CSV file or a sqlite3 +multiple CSV files or SQLite databases, and output a CSV file or a SQLite database. It can be used to combine multiple taxonomies into a single file, -as well as change formats between CSV and sqlite3. +as well as change formats between CSV and SQLite. The following command will take in two taxonomy files and combine them into a single taxonomy SQLite database. @@ -1077,7 +1081,7 @@ Note: `tax grep` only searches taxonomic ranks, not identifier strings. Use `sig grep` to search for identifiers in sketch collections. Currently only CSV output (optionally gzipped) is supported; use `sourmash tax prepare` to -convert CSV output from `tax grep` into a sqlite3 taxonomy database. +convert CSV output from `tax grep` into a SQLite taxonomy database. ### `sourmash tax summarize` - print summary information for lineage spreadsheets or taxonomy databases @@ -1188,7 +1192,7 @@ a status of `disagree` with the genus-level assignment of *Shewanella*; species level assignments would not be reported. Here, the assigned rank is the rank immediately *above* where there is a taxonomic disagreement, and the taxid & lineage refer to the name at -that rank (the least-common-ancestor at which an assignment can be +that rank (the lowest common ancestor at which an assignment can be made). For another example, if you saw this line in the CSV file: @@ -1364,8 +1368,8 @@ agreement/disagreement. Please see the blog post [Why are taxonomic assignments so different for Tara bins?](http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html) for an example use case. -[0]:https://en.wikipedia.org/wiki/MinHash -[1]:http://mash.readthedocs.io/en/latest/__ +[0]:https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2 +[1]:http://mash.readthedocs.io/en/latest/ [2]:http://biorxiv.org/content/early/2015/10/26/029827 [3]:https://en.wikipedia.org/wiki/Jaccard_index @@ -1403,6 +1407,12 @@ sourmash signature cat file1.sig file2.sig -o all.zip will combine all signatures in `file1.sig` and `file2.sig` and put them in the file `all.zip`. +#### Using picklists with `sourmash sig cat` + +As of sourmash 4.2.0, `cat` also supports picklists, a feature by +which you can select signatures based on values in a CSV file. See +[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below. + ### `sourmash signature describe` - display detailed information about signatures Display signature details. @@ -1767,9 +1777,36 @@ and/or containment might be very close to zero. For example, ``` -sourmash signature overlap file1.sig file2.sig +sourmash signature overlap tests/test-data/63.fa.sig \ + tests/test-data/47.fa.sig +``` +will display the detailed comparison of the two files like so: +```text +loaded one signature each from tests/test-data/63.fa.sig and tests/test-data/47.fa.sig +first signature: + signature filename: tests/test-data/63.fa.sig + signature: NC_011663.1 Shewanella baltica OS223, complete genome + md5: 38729c6374925585db28916b82a6f513 + k=31 molecule=DNA num=0 scaled=1000 + +second signature: + signature filename: tests/test-data/47.fa.sig + signature: NC_009665.1 Shewanella baltica OS185, complete genome + md5: 09a08691ce52952152f0e866a59f6261 + k=31 molecule=DNA num=0 scaled=1000 + +similarity: 0.32069 +first contained in second: 0.48282 +second contained in first: 0.48851 + +number of hashes in first: 5238 +number of hashes in second: 5177 + +number of hashes in common: 2529 +only in first: 2709 +only in second: 2648 +total (union): 7886 ``` -will display the detailed comparison of `file1.sig` and `file2.sig`. `sig overlap` can only work with compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, @@ -1970,7 +2007,7 @@ CSV file (based on the headers in the first line of the CSV file), and `:include` or `:exclude`, can be added as a fourth parameter; if omitted, the default is `:include`. -The following `coltype`s are currently supported by `sourmash sig extract`: +The following `coltype`s are currently supported for picklists: * `name` - exact match to signature's name * `md5` - exact match to signature's md5sum @@ -2032,8 +2069,8 @@ slow, especially for many (100s or 1000s) of signatures. All of the `sourmash` commands support loading collections of signatures from zip files. You can create a compressed collection of -signatures using `zip -r collection.zip *.sig` and then specify -`collections.zip` on the command line. +signatures using `sourmash sig cat *.sig -o collections.zip` and then +specifying `collections.zip` on the command line in place of `*.sig`. ### Choosing signature output formats diff --git a/doc/index.md b/doc/index.md index 2adbf317e9..5a09ea7052 100644 --- a/doc/index.md +++ b/doc/index.md @@ -11,11 +11,11 @@ viral genomes. You might try sourmash if you want to - -* identify which reference genomes to map your metagenomic reads to -* search all Genbank microbial genomes with a sequence query -* cluster many genomes by similarity -* taxonomically classify genomes or metagenomes against NCBI and/or GTDB -* search thousands of metagenomes with a query genome or sequence +* identify which reference genomes to use for metagenomic read mapping; +* search all Genbank microbial genomes with a sequence query; +* cluster hundreds or thousands of genomes by similarity; +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; +* search thousands of metagenomes with a query genome or sequence; Our **vision**: sourmash strives to support biologists in analyzing modern sequencing data at high resolution and with full context, diff --git a/doc/publications.md b/doc/publications.md index 44e8e9c4f9..bcef21a47f 100644 --- a/doc/publications.md +++ b/doc/publications.md @@ -13,7 +13,7 @@ Pierce et al., 2019. This is the original sourmash use case paper. [Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0), Portik et al., 2022. This paper shows that sourmash is extremely -sensitive and very specific for taxonomic classification. +sensitive and very specific for taxonomic profiling. ## Petabase-scale search