diff --git a/doc/index.md b/doc/index.md index 4e7a9927e3..14d3a2e0ec 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,191 +1,114 @@ # Welcome to sourmash! -sourmash is a command-line tool and Python library for computing -[hash sketches](https://en.wikipedia.org/wiki/MinHash) from DNA -sequences, comparing them to each other, and plotting the results. -This allows you to estimate sequence similarity between even very -large data sets quickly and accurately. - -sourmash can be used to quickly search large databases of genomes -for matches to query genomes and metagenomes; see [our list of -available databases](databases.md). - -sourmash also includes k-mer based taxonomic exploration and -classification routines for genome and metagenome analysis. These -routines can use the NCBI and GTDB taxonomies but do not depend on them -specifically. - -We have [several tutorials](tutorials.md) available! Start with -[Making signatures, comparing, and searching](tutorial-basic.md). - -The paper [Large-scale sequence comparisons with sourmash (Pierce et al., 2019)](https://f1000research.com/articles/8-1006) -gives an overview of how sourmash works and what its major use cases are. -Please also see the `mash` [software](http://mash.readthedocs.io/en/latest/) and -[paper (Ondov et al., 2016)](http://dx.doi.org/10.1186/s13059-016-0997-x) for -background information on how and why MinHash works. - -**Questions? Thoughts?** Ask us on the [sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues/)! - -**Want to migrate to sourmash v4?** sourmash v4 is now available, and -has a number of incompatibilites with v2 and v3. Please see -[our migration guide](support.md#migrating-from-sourmash-v3x-to-sourmash-v4x)! - ----- - -To use sourmash, you must be comfortable with the UNIX command line; -programmers may find the [Python library and API](api.md) useful as well. - -If you use sourmash, please cite us! - -> Brown and Irber (2016), -> **sourmash: a library for MinHash sketching of DNA**. -> Journal of Open Source Software, 1(5), 27, [doi:10.21105/joss.00027](https://joss.theoj.org/papers/3d793c6e7db683bee7c03377a4a7f3c9) - -## sourmash in brief - -sourmash uses MinHash-style sketching to create "signatures", compressed -representations of DNA/RNA sequence. These signatures can then -be stored, searched, explored, and taxonomically annotated. - -* `sourmash` provides command line utilities for creating, comparing, - and searching signatures, as well as plotting and clustering - signatures by similarity (see [the command-line docs](command-line.md)). - -* `sourmash` can **search very large collections of signatures** to find matches - to a query. - -* `sourmash` can also **identify parts of metagenomes that match known genomes**, - and can **taxonomically classify genomes and metagenomes** against databases - of known species. +```{contents} Contents +:depth: 3 +``` -* `sourmash` can be used to **search databases of public sequences** - (e.g. all of GenBank) and can also be used to create and search databases - of **private sequencing data**. +sourmash is a command-line tool and Python/Rust library for +**metagenome analysis** and **genome comparison** with k-mers. It +supports the compositional analysis of metagenomes, rapid search of +large sequence databases, and flexible taxonomic analysis with both +NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or +larger, including bacterial and viral genomes. -* `sourmash` supports saving, loading, and communication of signatures - via [JSON](http://www.json.org/), a ~human-readable and editable format. +You might try sourmash if you want to - -* `sourmash` also has a simple Python API for interacting with signatures, - including support for online updating and querying of signatures - (see [the API docs](api.md)). +* identify which reference genomes to map your metagenomic reads to +* search all Genbank microbial genomes with a sequence query +* cluster many genomes by similarity +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; +* search thousands of metagenomes with a query genome or sequence -* `sourmash` relies on an underlying Rust core for performance. +Our **vision**: sourmash strives to support biologists in analyzing +modern sequencing data at high resolution and with full context, +including all public reference genomes and metagenomes. -* `sourmash` is developed [on GitHub](https://github.com/sourmash-bio/sourmash) - and is **freely and openly available** under the BSD 3-clause license. - Please see [the README](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) - for more information on development, support, and contributing. +## How does sourmash work? -You can take a look at sourmash analyses on real data -[in a saved Jupyter notebook](https://github.com/sourmash-bio/sourmash/blob/latest/doc/sourmash-examples.ipynb), -and experiment with it yourself -[interactively in a Jupyter Notebook](https://mybinder.org/v2/gh/sourmash-bio/sourmash/latest?labpath=doc%2Fsourmash-examples.ipynb) -at [mybinder.org](http://mybinder.org). +Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and +lightweight sequence comparison; FracMinHash builds on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity +_and_ containment analyses with k-mers. This significantly expands +the range of operations that can be done quickly and in low +memory. sourmash also implements a number of new and powerful analysis +techniques, including minimum metagenome covers and alignment-free ANI +estimation. -## Installing sourmash +sourmash is inspired by [mash](https://mash.readthedocs.io), and +supports most mash analyses. sourmash also implements an expanded set +of functionality for metagenome and taxonomic analysis. -You can use pip: -```bash -$ pip install sourmash -``` +sourmash development was initiated with a grant from the Moore +Foundation under the Data Driven Discovery program, and has been +supported by further funding from the NIH and NSF. Please see +[funding acknowledgements](funding.md) for details! -or conda: -```bash -$ conda install -c conda-forge -c bioconda sourmash -``` +## Mission statement -Please see [the README file in github.com/sourmash-bio/sourmash](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) -for more information. +The project mission is to provide practical tools and approaches for +analyzing extremely large sequencing data sets, with an emphasis on +high resolution results. We design around the following principles: -## Memory and speed +* genomic and metagenomic analyses should be able to make use of all + available reference genomes. +* metagenomic analyses should support assembly independent approaches, + to avoid biases stemming from low coverage or high strain + variability. +* private and public databases should be equally well supported. +* a variety of data structures and algorithms are necessary to support + a wide set of use cases, including efficient command-line analysis, + real-time queries, and massive-scale batch analyses. +* our tools should be well behaved members of the bioinformatics + analysis tool ecosystem, and use common installation approaches, + standard formats, and semantic versioning. +* our tools should be robustly tested, well documented, and supported. +* we discuss scientific and computational tradeoffs and make specific + recommendations where possible, admitting uncertainty as needed. -sourmash has relatively small disk and memory requirements compared to -many other software programs used for genome search and taxonomic -classification. +## Using sourmash -`sourmash search` and `sourmash gather` can be used to search 100k -genbank microbial genomes ([using our prepared databases](databases.md)) -with about 20 GB of disk and in under 1 GB of RAM. -Typically a search for a single genome takes about 30 seconds on a laptop. +### Tutorials and examples -`sourmash lca` can be used to search/classify against all genbank -microbial genomes with about 200 MB of disk space and about 10 GB of -RAM. Typically a metagenome classification takes about 1 minute on a -laptop. +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. -## sourmash versioning +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) -We support the use of sourmash in pipelines and applications -by communicating clearly about bug fixes, feature additions, and feature -changes. We use version numbers as follows: +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) -* Major releases, like v4.0.0, may break backwards compatibility at - the command line as well as top-level Python/Rust APIs. -* Minor releases, like v4.1.0, will remain backwards compatible but - may introduce significant new features. -* Patch releases, like v4.1.1, are for minor bug fixes; full backwards - compatibility is retained. +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) -If you are relying on sourmash in a pipeline or application, we -suggest specifying your version requirements at the major release, -e.g. in conda you would specify `sourmash>=3,<4`. +* [Some sourmash command line examples!](sourmash-examples.ipynb) -See [the Versioning docs](support.md) for more information on what our -versioning policy means in detail, and how to migrate between major -versions! +### How-To Guides -## Limitations +* Installing sourmash -**sourmash cannot find matches across large evolutionary distances.** +* [Classifying genome sketches](classifying-signatures.md) -sourmash seems to work well to search and compare data sets for -nucleotide matches at the species and genus level, but does not have much -sensitivity beyond that. (It seems to be particularly good at -strain-level analysis.) You should use protein-based analyses -to do searches across larger evolutionary distances. +* [Working with private collections of genome sketches.](sourmash-collections.ipynb) -**sourmash signatures can be very large.** +* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) -We use a modification of the MinHash sketch approach that allows us -to search the contents of metagenomes and large genomes with no loss -of sensitivity, but there is a tradeoff: there is no guaranteed limit -to signature size when using 'scaled' signatures. +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). -## Logo +* [A short guide to using sourmash output with R](other-languages.md). -The sourmash logo was designed by Stéfanie Fares Sabbag, -with feedback from Clara Barcelos, -Taylor Reiter and Luiz Irber. +### How sourmash works under the hood -
+* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) +* [Support, versioning, and migration between versions](support.md) -The logo -is licensed under a Creative Commons -Attribution-ShareAlike 4.0 International License. +### Reference material -## Contents: +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) +* [A guide to the internals of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) -```{toctree} ---- -maxdepth: 2 ---- - -command-line -tutorials -using-sourmash-a-guide -classifying-signatures -databases -api -more-info -support -developer -``` +## Developing and extending sourmash -# Indices and tables +* [Releasing a new version of sourmash](release.md) -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search` diff --git a/doc/new.md b/doc/new.md deleted file mode 100644 index 14d3a2e0ec..0000000000 --- a/doc/new.md +++ /dev/null @@ -1,114 +0,0 @@ -# Welcome to sourmash! - -```{contents} Contents -:depth: 3 -``` - -sourmash is a command-line tool and Python/Rust library for -**metagenome analysis** and **genome comparison** with k-mers. It -supports the compositional analysis of metagenomes, rapid search of -large sequence databases, and flexible taxonomic analysis with both -NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or -larger, including bacterial and viral genomes. - -You might try sourmash if you want to - - -* identify which reference genomes to map your metagenomic reads to -* search all Genbank microbial genomes with a sequence query -* cluster many genomes by similarity -* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; -* search thousands of metagenomes with a query genome or sequence - -Our **vision**: sourmash strives to support biologists in analyzing -modern sequencing data at high resolution and with full context, -including all public reference genomes and metagenomes. - -## How does sourmash work? - -Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and -lightweight sequence comparison; FracMinHash builds on -[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity -_and_ containment analyses with k-mers. This significantly expands -the range of operations that can be done quickly and in low -memory. sourmash also implements a number of new and powerful analysis -techniques, including minimum metagenome covers and alignment-free ANI -estimation. - -sourmash is inspired by [mash](https://mash.readthedocs.io), and -supports most mash analyses. sourmash also implements an expanded set -of functionality for metagenome and taxonomic analysis. - -sourmash development was initiated with a grant from the Moore -Foundation under the Data Driven Discovery program, and has been -supported by further funding from the NIH and NSF. Please see -[funding acknowledgements](funding.md) for details! - -## Mission statement - -The project mission is to provide practical tools and approaches for -analyzing extremely large sequencing data sets, with an emphasis on -high resolution results. We design around the following principles: - -* genomic and metagenomic analyses should be able to make use of all - available reference genomes. -* metagenomic analyses should support assembly independent approaches, - to avoid biases stemming from low coverage or high strain - variability. -* private and public databases should be equally well supported. -* a variety of data structures and algorithms are necessary to support - a wide set of use cases, including efficient command-line analysis, - real-time queries, and massive-scale batch analyses. -* our tools should be well behaved members of the bioinformatics - analysis tool ecosystem, and use common installation approaches, - standard formats, and semantic versioning. -* our tools should be robustly tested, well documented, and supported. -* we discuss scientific and computational tradeoffs and make specific - recommendations where possible, admitting uncertainty as needed. - -## Using sourmash - -### Tutorials and examples - -These tutorials are command line tutorials that should work on Mac OS -X and Linux. They require about 5 GB of disk space and 5 GB of RAM. - -* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) - -* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) - -* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) - -* [Some sourmash command line examples!](sourmash-examples.ipynb) - -### How-To Guides - -* Installing sourmash - -* [Classifying genome sketches](classifying-signatures.md) - -* [Working with private collections of genome sketches.](sourmash-collections.ipynb) - -* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) - -* [Building plots from `sourmash compare` output](plotting-compare.ipynb). - -* [A short guide to using sourmash output with R](other-languages.md). - -### How sourmash works under the hood - -* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) -* [Support, versioning, and migration between versions](support.md) - -### Reference material - -* [UNIX command-line documentation](command-line.md) -* [Genbank and GTDB databases and taxonomy files](databases.md) -* [Python examples using the API](api-example.md) -* [Publications about sourmash](publications.md) -* [A guide to the internals of sourmash](sourmash-internals.md) -* [Funding acknowledgements](funding.md) - -## Developing and extending sourmash - -* [Releasing a new version of sourmash](release.md) -