Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: revise documentation structure; add internals page. #2184

Merged
merged 66 commits into from
Oct 15, 2023

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 6, 2022

This PR rearranges the docs to according to the https://diataxis.fr/ structure, per #2054.

New pages:

Fixes #2054 (document restructuring)
Fixes #932 (add an FAQ)
Fixes #2760 (tax preferred to lca)
Tackles #1227 (what is gather)
Fixes #971 (funding acks)
Fixes #1289 (p_match and p_query)
Fixes #1531 (document memory tradeoffs in save formats)
Fixes #1532 (order of database load/reporting)
Fixes #1609 (better gather description, f_unique_query, etc.)
Fixes #2170 (use detection)
Fixes #1881 (correlation with read mapping)
Fixes #2566 (retrieving reads)
Fixes #2775 (vision & mission)

@codecov
Copy link

codecov bot commented Aug 6, 2022

Codecov Report

Merging #2184 (b944b69) into latest (9ff523e) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           latest    #2184   +/-   ##
=======================================
  Coverage   86.27%   86.27%           
=======================================
  Files         130      130           
  Lines       14750    14750           
  Branches     2623     2623           
=======================================
  Hits        12725    12725           
  Misses       1724     1724           
  Partials      301      301           
Flag Coverage Δ
hypothesis-py 25.81% <ø> (ø)
python 92.85% <ø> (ø)
rust 50.92% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb ctb changed the title [WIP] trying out a new docs structure [WIP] revise documentation structure; add internals page. Aug 16, 2022
@ctb ctb added the doc documentation content or issues label Aug 16, 2022

## Signatures and sketches

sourmash operates on sketches. Each sketch is a collection of hashes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sourmash operates on sketches. Each sketch is a collection of hashes.
sourmash operates on sketches. Each sketch is a collection of hashes. Each hash is a whole number product of a k-mer and hash function.

Maybe link the hash function to the ## making sketches section of the document. May wish to include another line about k-mers as well. Such as,
"A k-mer is a short sub-sequence generated from a larger sequence file via a sliding window approach.

I.e. If k=3, a 3-mer of an 8 base pair DNA molecule will have 6 individual 3-mers digested into hashes to populate a sketch that are stored in a signature.

The DNA sequence:
AGTCATCG

The k-mers of the sliding window process:
[AGT].....
.[GTC]....
..[TCA]...
...[CAT]..
....[ATC].
.....[TCG]

Hashes digested from each k-mer:
2118360698
1681319365
65865673
1255238040
364627471
659934874
"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could point at https://sourmash.readthedocs.io/en/latest/kmers-and-minhash.html for a basic intro, rather than rehashing (hah!) k-mers. What do you think?

Here are my revisions to the intro para - thoughts?

sourmash operates on sketches. Each sketch is a collection of hashes,
which are in turn built from k-mers by appling a hash function
(currently always murmurhash) and a filtering function. Each sketch
is contained in a signature wrapper that contains some metadata.

thanks!

Co-authored-by: ccbaumler <[email protected]>
doc/faq.md Outdated Show resolved Hide resolved
## How do k-mer-based analyses compare with read mapping?

tl;dr very well! But it's a bit one sided: if k-mers match, reads will
map, but not necessarily vice versa. So read mapping rates are almost always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the explicit 'vice versa'? 'if reads map, kmers will match?' What does that mean?
I guess I just don't understand the fundementals of this section, may be worth adding a longer section before the tldr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly: if reads map, k-mers may not match. Will think on how and if to improve text!

@ccbaumler
Copy link
Contributor

I ran out of time to go over internals but I am most excited about that document. What do you think about including sections for an explanation of the outputs of the columns of the csv files for prefetch and gather. Simply, a table:

Gather csv file column explanation:

column headers what the numbers/output mean
potential_false_negative
md5
n_unique_weighted_found

PS. I agree with taylor about the navigation... And this is really nice documentation. Thanks for writing it.

@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2023

  • The navigation bar on the left of the index page is confusing given the contents of the index page. Since they don't mirror each other, it's very busy and I'm not sure where to dive in.

Updated! Curious what you think of the new approach @ccbaumler @taylorreiter.

@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2023

ok, figured it all out - here's the latest index page + sidebar:

https://sourmash--2184.org.readthedocs.build/en/2184/

@ccbaumler
Copy link
Contributor

I like the simpler navigation using the markdown headers with the internal links in the sections.

In addition, there are three universal pages I can think of that people would want to get to quickly from the landing page:

  1. The cli options
  2. The api options
  3. The prepared databases

Could those three be included in the navigation... Maybe in Detailed usage: section in the sidebar?

@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2023

added in 8144c3c @ccbaumler - thx for the nudge! should be up shortly on the RTD pull request build.

@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2023

  • "Guide to the internals" is a sort of confusing name that could use a paragraph intro unpacking what that means. I see that it's not meant to be complete from the paragraph that is there, but is it internal commands? the python API internally? rust? the scientific theory? all of it? or maybe just a better title would be helpful for orienting readers to what is on the page.

I looked at the page and this is the second paragraph:

This document is a brain dump intended for expert users and sourmash
developers who want to understand how, why, and when to use various
sourmash features.

I'm not sure what more to add, or how to change it. Maybe a change to the title of the page?
I tried changing it to "A guide to the internal design and structure of sourmash". More suggestions welcome ;)

@ctb
Copy link
Contributor Author

ctb commented Oct 15, 2023

#2811 tackles updating instructions for developers ref #2707.

@ctb
Copy link
Contributor Author

ctb commented Oct 15, 2023

OK, I'm going to merge this and then deal with remaining comments later - it's gotten big and unwieldy to update at this point, so I don't want to delay it more!

@ctb ctb merged commit 0db31c8 into latest Oct 15, 2023
35 checks passed
@ctb ctb deleted the update_doc_structure branch October 15, 2023 18:24
ctb added a commit that referenced this pull request Oct 16, 2023
Follow-up to #2184 with minor corrections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment