MRG: add full column descriptions for `gather` and `prefetch` output #2954

ctb · 2024-01-29T18:00:51Z

This PR adds full column descriptions for gather and prefetch to classifying-signatures.md. It also updates some other details in that document, including adding a link to the published Hera et al. paper in 2023.

See rendered docs!

Fixes #2812
Fixes #2367

codecov · 2024-01-29T18:09:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9033d6d) 86.61% compared to head (40aae2a) 86.61%.

Additional details and impacted files

@@           Coverage Diff           @@
##           latest    #2954   +/-   ##
=======================================
  Coverage   86.61%   86.61%           
=======================================
  Files         135      135           
  Lines       15262    15262           
  Branches     2622     2622           
=======================================
  Hits        13219    13219           
  Misses       1743     1743           
  Partials      300      300

Flag	Coverage Δ
hypothesis-py	`25.83% <ø> (ø)`
python	`92.82% <ø> (ø)`
rust	`59.22% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ctb · 2024-01-29T19:41:49Z

@sourmash-bio/devs ready for review and merge!

ccbaumler · 2024-01-29T21:35:27Z

What about a table instead of a list?

`Gather` column header	Type	Description
`unique_intersect_bp`	integer	Size of overlap between match and remaining query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes.
`intersect_bp`	integer	Size of overlap between match and query, estimated by multiplying the number of overlapping hashes by scaled. Independent of rank order and will often double-count hashes.
`f_orig_query`	float	The fraction of the original query represented by this match. Approximates the fraction of metagenomic reads that will map to this genome.
`f_match`	float	The containment of the match in the query.
`f_unique_to_query`	float	The fraction of matching hashes (unweighted) that are unique to this query; rank dependent. Will sum to the fraction of total k-mers (unweighted) that were identified.
`f_unique_weighted`	float	The fraction of matching hashes (weighted by multiplicity) that are unique to this query. This will sum to the fraction of total weighted k-mers that were identified. Approximates the fraction of metagenomic reads that will map to this genome after all previous matches at lower (earlier) ranks are mapped.
`average_abund`	float	Mean abundance of the weighted hashes unique to the intersection. Empty if query does not have abundance. Rank dependent, does not double count.
`median_abund`	integer	Median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
`std_abund`	float	Std deviation of the abundance of the hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
`filename`	string	Filename/location of the database from which the match was loaded.
`name`	string	Full sketch name of the match.
`md5`	string	Full md5sum of the match sketch.
`f_match_orig`	float	The fraction of the match in the full query. Rank independent.
`gather_result_rank`	float	Rank of this match in the results.
`remaining_bp`	integer	How many bp remain in the query after subtracting this match, estimated by multiplying remaining hashes by scaled.
`query_filename`	string	The filename from which the query was loaded.
`query_name`	string	The query sketch name.
`query_md5`	string	Truncated md5sum of the query sketch.
`query_bp`	integer	Estimated number of bp in the query, estimated by multiplying the sketch size by scaled.
`ksize`	integer	K-mer size for the sketches used in the comparison.
`moltype`	string	Molecule type of the comparison.
`scaled`	integer	Scaled value of the comparison.
`query_n_hashes`	integer	Number of hashes in the query sketch.
`query_abundance`	boolean	True if the query has abundance information; False otherwise.
`query_containment_ani`	float	ANI estimated from the query containment in the match.
`match_containment_ani`	float	ANI estimated from the match containment in the query.
`average_containment_ani`	float	ANI estimated from the average of the query and match containment.
`max_containment_ani`	float	ANI estimated from the max of the query and match containment.
`potential_false_negative`	boolean	True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise.
`n_unique_weighted_found`	integer	Sum of (abundance-weighted) hashes found in this rank.
`sum_weighted_found`	integer	Sum of the hashes x abundance found thus far, i.e., running total of `n_unique_weighted_found`. The last value divided by `total_weighted_hashes` will equal the total fraction of (weighted) k-mers identified.
`total_weighted_hashes`	integer	Sum of hashes x abundance for the entire dataset. Constant value.

`Prefetch` column header	Type	Description
`intersect_bp`	integer	Size of overlap between match and original query, estimated by multiplying the number of overlapping hashes by `scaled`.
`jaccard`	float	Jaccard similarity of the two sketches.
`max_containment`	float	Max of `f_query_match` and `f_match_query`.
`f_query_match`	float	The fraction of the query contained by the match.
`f_match_query`	float	The fraction of the match contained by the query.
`match_filename`	string	Filename the match sketch was loaded from.
`match_name`	string	Full name of the match sketch.
`match_md5`	string	Truncated md5sum of match sketch (8 char).
`match_bp`	integer	Size of match, estimated by multiplying the sketch size by scaled.
`query_filename`	string	Filename the query sketch was loaded from.
`query_name`	string	Full name of the query sketch.
`query_md5`	string	Truncated md5sum of query sketch (8 char).
`query_bp`	integer	Size of query, estimated by multiplying the sketch size by scaled.
`ksize`	integer	K-mer size for the sketches used in the comparison.
`moltype`	string	Molecule type of the sketches.
`scaled`	integer	Scaled value at which the comparison was done.
`query_n_hashes`	integer	Number of hashes in the query.
`query_abundance`	integer	Median hash abundance in the sketch, if available.
`query_containment_ani`	float	ANI estimated from the query containment in the match.
`match_containment_ani`	float	ANI estimated from the match containment in the query.
`average_containment_ani`	float	ANI estimated from the average of the query and match containment.
`max_containment_ani`	float	ANI estimated from the max containment between query/match.
`potential_false_negative`	boolean	True if the sketch size(s) were too small to give a reliable ANI estimate. False if ANI estimate is reliable.

doc/classifying-signatures.md

Co-authored-by: Colton Baumler <[email protected]>

ctb · 2024-01-29T22:01:10Z

hmm, I like the table... I'm worried that updating it will be annoying. I think on it and maybe try it out!

…o doc/add_gather_prefetch_cols

ctb · 2024-01-30T17:29:17Z

columns work great, thanks @ccbaumler! Updated & will merge sometime after the tests pass, unless someone drops by with a new comment first ;)

…o doc/add_gather_prefetch_cols

ccbaumler

LGTM

release notes: https://hackmd.io/SCoVcWS1RhCH-ndQNWF-1A?view # sourmash release v4.8.6 - release notes Minor new features: * re-establish `tax` gather reading flexibility (#2986) * update JOSS paper per pyopensci review (#2964) * Clean up and refactor `KmerMinHash::merge` in core (#2973) * add label output & input options to `compare` and `plot`, for better customization (#2598) * add utilities for using ictv taxonomic ranks with `sourmash tax` (#2608) Bug fixes: * Fix `tax metagenome` to work on gather output created with `--estimate-ani-ci` (#2952) * fix gather memory usage issue by not accumulating `GatherResult` (#2962) * update the CLI docs and help for `search --containment` and `prefetch` (#2971) Documentation updates: * update tutorial to remove bioconda & use sourmash-minimal (#2972) * update readme with maintainers & sourmash comparison info (#2965) * add branchwater reference; make FAQ more visible (#2984) * update FAQ answer on k-mer size (#2899) * update README with repostatus and pyver badges, and Windows support (#2928) * add full column descriptions for `gather` and `prefetch` output (#2954) * add scaled FAQ, adjust ksize answer (#2921) * minor refactoring of gather code, small doc updates (#2953) * Add threshold-bp and scaled relationship to faqs (#2930) Developer updates: * nix updates for pyopensci review (#2975) * add scaled selection to manifest; add helper functions for collection and sig/sketch usage (#2948) * Pre-commit updates (#2427) * fix upload wheel CI (#2974) * release core; bump rust core version to r0.12.1 (#2988) * CI: macos deployment target and maturin updates (#2879) * MRG: bump version to 4.8.6-dev, post-release (#2877) * fix benchmark code & a few other small issues from pyOpenSci review (#2920) * fix uploading of wheels after upload-artifact upgrade. (#2887) * in core, enable downsample within select (#2931) Dependabot updates: * Bump pypa/cibuildwheel from 2.16.4 to 2.16.5 (#2981) * Bump tempfile from 3.9.0 to 3.10.0 (#2979) * Bump rkyv from 0.7.43 to 0.7.44 (#2978) * Bump actions/cache from 3 to 4 (#2933) * Bump actions/download-artifact from 3 to 4 (#2884) * Bump actions/upload-artifact from 3 to 4 (#2883) * Bump cachix/cachix-action from 13 to 14 (#2926) * Bump cachix/install-nix-action from 24 to 25 (#2927) * Bump chrono from 0.4.31 to 0.4.33 (#2957) * Bump getrandom from 0.2.11 to 0.2.12 (#2924) * Bump histogram from 0.8.3 to 0.8.4 (#2923) * Bump histogram from 0.8.4 to 0.9.0 (#2935) * Bump jinja2 from 3.1.2 to 3.1.3 (#2922) * Bump memmap2 from 0.9.0 to 0.9.2 (#2882) * Bump memmap2 from 0.9.2 to 0.9.3 (#2889) * Bump memmap2 from 0.9.3 to 0.9.4 (#2958) * Bump mymindstorm/setup-emsdk from 13 to 14 (#2934) * Bump ouroboros from 0.18.1 to 0.18.2 (#2894) * Bump ouroboros from 0.18.2 to 0.18.3 (#2936) * Bump pypa/cibuildwheel from 2.16.2 to 2.16.4 (#2960) * Bump rayon from 1.8.0 to 1.8.1 (#2937) * Bump rkyv from 0.7.42 to 0.7.43 (#2880) * Bump serde from 1.0.194 to 1.0.195 (#2901) * Bump serde from 1.0.195 to 1.0.196 (#2956) * Bump serde_json from 1.0.108 to 1.0.110 (#2896) * Bump serde_json from 1.0.110 to 1.0.111 (#2902) * Bump serde_json from 1.0.111 to 1.0.113 (#2955) * Bump shlex from 1.1.0 to 1.3.0 (#2940) * Bump supercharge/redis-github-action from 1.7.0 to 1.8.0 (#2885) * Bump tempfile from 3.8.1 to 3.9.0 (#2893) * Bump thiserror from 1.0.50 to 1.0.51 (#2881) * Bump thiserror from 1.0.51 to 1.0.56 (#2897) * Bump wasm-bindgen from 0.2.89 to 0.2.90 (#2925) * Bump wasm-bindgen-test from 0.3.39 to 0.3.40 (#2938) * Bump web-sys from 0.3.66 to 0.3.67 (#2939) * Update pytest requirement from <7.5.0,>=6.2.4 to >=6.2.4,<8.1.0 (#2959)

add full column descriptions per #2812

f0a45d7

changes to intro and toc

33fbcc7

ctb changed the title ~~WIP: add full column descriptions for gather and prefetch output~~ MRG: add full column descriptions for gather and prefetch output Jan 29, 2024

update text and links

1ee6367

ccbaumler reviewed Jan 29, 2024

View reviewed changes

doc/classifying-signatures.md Outdated Show resolved Hide resolved

Update doc/classifying-signatures.md

c9f2fc3

Co-authored-by: Colton Baumler <[email protected]>

ctb added 2 commits January 30, 2024 09:27

table

9373112

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

b3e8f0f

…o doc/add_gather_prefetch_cols

ctb mentioned this pull request Jan 30, 2024

consider reorganizing gather/prefetch column documentation #2961

Open

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

40aae2a

…o doc/add_gather_prefetch_cols

ccbaumler approved these changes Jan 30, 2024

View reviewed changes

ctb merged commit e2c199f into latest Jan 30, 2024
37 of 38 checks passed

ctb deleted the doc/add_gather_prefetch_cols branch January 30, 2024 19:34

ctb mentioned this pull request Feb 10, 2024

MRG: 4.8.6 release branch #2963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: add full column descriptions for `gather` and `prefetch` output #2954

MRG: add full column descriptions for `gather` and `prefetch` output #2954

ctb commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 29, 2024 •

edited

Loading

ctb commented Jan 29, 2024

ccbaumler commented Jan 29, 2024 •

edited

Loading

ctb commented Jan 29, 2024

ctb commented Jan 30, 2024

ccbaumler left a comment

MRG: add full column descriptions for gather and prefetch output #2954

MRG: add full column descriptions for gather and prefetch output #2954

Conversation

ctb commented Jan 29, 2024 • edited Loading

codecov bot commented Jan 29, 2024 • edited Loading

Codecov Report

ctb commented Jan 29, 2024

ccbaumler commented Jan 29, 2024 • edited Loading

ctb commented Jan 29, 2024

ctb commented Jan 30, 2024

ccbaumler left a comment

Choose a reason for hiding this comment

MRG: add full column descriptions for `gather` and `prefetch` output #2954

MRG: add full column descriptions for `gather` and `prefetch` output #2954

ctb commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 29, 2024 •

edited

Loading

ccbaumler commented Jan 29, 2024 •

edited

Loading