From 872351de3994625dcfeac34d2e0376e698ed991b Mon Sep 17 00:00:00 2001 From: Tessa Pierce Ward Date: Mon, 16 Dec 2024 15:18:53 -0800 Subject: [PATCH] MRG: add genbank plant db to docs (#3429) created with directsketch; see https://github.com/bluegenes/2024-ds-plant for details ref https://github.com/sourmash-bio/sourmash/issues/3172 --------- Co-authored-by: C. Titus Brown --- doc/databases.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/doc/databases.md b/doc/databases.md index 3d607bdb9f..efdd7a55d7 100644 --- a/doc/databases.md +++ b/doc/databases.md @@ -37,7 +37,7 @@ genomes. Among other uses, they can be used to detect host contamination in microbial metagenomes. Each file includes sketches at k=21, k=31, and k=51, at a scaled of -1000, and is about 110 MB. +1000, and is under 50 MB. * Human (hg38) - [hg38.sig.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/host/hg38.sig.zip) * Cow (bosTau9) - [bosTau9.sig.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/host/bosTau9.sig.zip) @@ -49,6 +49,18 @@ Each file includes sketches at k=21, k=31, and k=51, at a scaled of * Goat (oviAri4) - [oviAri4.sig.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/host/oviAri4.sig.zip) * Pig (susCr11) - [susScr11.sig.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/host/susScr11.sig.zip) +## Sketches for plant genomes + +These sketches are for the plant genomes available in GenBank as of 2024-07. + +| K-mer size | Zipfile collection | +| -------- | -------- | +| k21 | [download (7G)](https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.k21.zip) | +| k31 | [download (8.8G)](https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.k31.zip) | +| k51 | [download (11G)](https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.k51.zip) | + +Lineage spreadsheet for sourmash `tax` commands: [download](https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.lineages.csv.gz) + ## GTDB R08-RS214 - DNA databases [GTDB R08-RS214](https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r08-rs214/456) consists of 402,709 genomes organized into 85,205 species clusters.