-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ingest tutorials #195
Add ingest tutorials #195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me
|
||
The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets. | ||
Note that the headers in this file use the human readable ``Name`` of the | ||
`NCBI Datasets' available fields <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields>`_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When designing the ingest workflow did you consider having an intermediate output which was exactly this (raw_metadata.tsv
, or maybe 2 files if the sequences were split out into a FASTA)? This would remove the need for users to run these 5 steps which in turn would encourage comparing the NCBI data vs the curated data as a normal part of ingest.
The files aren't too large, so I don't think space is a concern, but if it were we could easily mark them with temp()
and then this section would be "run with --notemp
to get the raw NCBI data to compare against the curated data".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I had not...This is definitely left over from working with the SARS-CoV-2 data, where I had focused on only using metadata columns that are needed for the workflow to avoid large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For huge datasets this is the right call, but we don't need to build every pipeline to handle the data requirements of SC2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the default config in the pathogen-repo-guide to include all of the fields of the raw metadata so that people can remove fields as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated tutorial to use the new Snakemake target from the pathogen-repo-guide in 43b0f77
@@ -0,0 +1,37 @@ | |||
1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many ways to get to the some outcome and docs like these always have to balance covering all the ways to get there vs detailing just one. (Examples are everywhere - we detail nextstrain CLI rather than the snakemake commands it's actually running, we assume non-ambient runtimes.) Sometimes adding little pointers to indicate the links between the different methods would be really helpful. In this example, we are running commands which are largely recreating the steps we just ran in the "running an ingest workflow" section - steps 2+3 are identical to snakemake --cores 1 --notemp fetch_ncbi_dataset_package
.
I'd change the introduction slightly to indicate this:
- If you want to see the uncurated NCBI Datasets data to decide what changes
- you would like to make to the workflow, you can run the following:
+ If you want to see the uncurated NCBI Datasets data to decide what changes
+ you would like to make to the workflow, you can download the raw NCBI data
+ by manually running commands very similar to those the pipeline used earlier
+ when "running an ingest workflow"
I'd be interested in adopting a consistent approach throughout the docs where we add little hints to explain links between what we're detailing and the other ways it can be / is done. Using this as an example (and based off the text at the end of this snippet), I'm imagining a hint section after the commands such as:
These commands are actually run by the ingest pipeline with some minor differences. The ingest pipeline restricts the columns to those defined in
config["ncbi_datasets_fields"]
, modifies the header names usingconfig["curate"]["field_map"]
(which maps the human-readable strings NCBI uses to ones more typical in Nextstrain pipelines), and changes the actual data via a number of curation steps.
I can add similar comments throughout these docs if you are interested in adopting this style, but wanted to discuss here first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs like these always have to balance covering all the ways to get there vs detailing just one.
I have the opposite expectation for a tutorial: it should teach how to get a task done using one straight-forward path. I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user. We've heard this from people in office hours as "There's so many ways to achieve X. What's the one way Nextstrain recommends doing it?"
I definitely think it's helpful for people who want to know more to have links to other explanation/how-to docs that include details on what's happening in the black box and how to do a task multiple ways. Although this does go back to the issue you raised in discussion of discoverability of these docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there's a range of views out there (including in our group).
I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user.
I'm not asking for documenting all of the ways, just adding little hints so that people can join the dots between concepts that may appear separate / independent. In this case it's providing clarifying hints that this section is very related to the section they've just run, even though on the surface they seem completely different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added your suggested hints in 6019d50. Feel free to add more!
|
||
.. include:: ../../snippets/uncurated-ncbi-dataset.rst | ||
|
||
We'll walk through an example custom config to include an additional column in the curated output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] - can you add a small subheading here to demarcate this from the snippet contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added subheading in 6019d50
========== | ||
|
||
* Run the `zika phylogenetic workflow <https://github.com/nextstrain/zika/tree/main/phylogenetic>`_ with new ingested data as input | ||
by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this won't include any of the data we (may) have just added in "Advanced usage: Customizing the ingest workflow" because the phylo workflow doesn't know about ingest/results/merged-metadata.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! I'll update to use ingest/results/merged-metadata.tsv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remembered that I didn't want to point to the new merged-metadata.tsv
because I didn't want to go into detail of how to add the new columns to the Auspice config of the phylogenetic workflow. That seems like it needs to be a whole other tutorial...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a line saying "If you've customized the ingest workflow then you may need to modify the phylo workflow to use your ingested data file if it's not results/metadata.tsv
, and other modifications to your phylo workflow may be needed, as appropriate." Or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in f021e76
|
||
$ mkdir ingest/build-configs/tutorial | ||
|
||
2. Create a new config file ``ingest/build-configs/tutorial/config.yaml`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this big YAML block made me think "whoa, were did all these values come from"? There are a couple of changes you may want to consider here:
- I'd move the text below the YAML to above it. So instead of starting with a big config file¹, start with "the parts of the config we want to change are
ncbi_datasets_fields
(because...),field_map
(because...) andmetadata_column
(because). - Then detail that we copy certain sections from the existing config then add the changes just described.
¹ The config YAML alone takes up the full height of a laptop screen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 9255b1a
|
||
We highly encourage you to go through the commands and custom scripts used in the ``curate`` rule within ``ingest/rules/curate.smk`` | ||
to gain a deeper understanding of how they work. | ||
We will give a brief overview of each step and their relevant config parameters defined in ``ingest/defaults/config.yaml`` to help you get started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are mostly taking a template curate pipeline and describing the config values you can change. The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config. This reminds me of a similar situation in visualisation where some people build libraries of charts while others expose the methods to build such charts (e.g. see Leland Wilkinson quote here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config.
Hmm, that seems to be stepping into Snakemake tutorial territory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm wondering how feasible it is for a user to develop an ingest pipeline without knowing Snakemake. (This is the analogue of designing/documenting a charting library.) If that's going to work for a number of users then great! The phrasing you used here ("highly encourage you to go through the [snakemake rules]", "we will give you a brief overview" etc) made it seem to me like knowing Snakemake was a prerequisite.
Anyway, the docs as they stand are an improvement so I wouldn't want this to hold up their merge. Rather something to think about as we move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it seem to me like knowing Snakemake was a prerequisite.
Yup! I have Snakemake listed as a prerequisite for the "Creating an ingest workflow" tutorial at the top:
docs.nextstrain.org/src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst
Lines 22 to 24 in 79a0176
Additionally, to follow this tutorial, you will need: | |
* An understanding of `Snakemake <https://snakemake.readthedocs.io/en/stable/>`_ workflows. |
========== | ||
|
||
* Learn more about :doc:`augur curate commands <augur:usage/cli/curate/index>` | ||
* Learn how to create a phylogenetic workflow (TKTK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use TKTK in live docs (assuming you aren't going to add the tutorial it in this PR). I'd replace this with "We're planning on writing a similar workflow for the phylogenetic pipeline, but until that's ready you can the best place to learn about these is ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Thank you for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in f021e76
|
||
1. Add your pathogen's NCBI taxonomy ID to the ``ncbi_taxon_id`` parameter | ||
2. If there are other NCBI Datasets fields you would like to include in the download, you can add them to the ``ncbi_datasets_fields`` parameter | ||
3. Skip to the :ref:`curation-steps`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in zika we do some extra steps (format_ncbi_datasets_ndjson
) - do you want to discuss them here? (Maybe I've missed it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had excluded it because it's not a configurable step and seemed like an implementation detail that's not immediately useful to first time users.
Provides an easy way for first time users to get the uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `data/ncbi_dataset_report.tsv`. Afterwards, users can easily remove fields that are not needed as part the workflow to reduce the file size and save space. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment)
Provides an easy way for first time users to get the full uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `dump_ncbi_dataset_report`. They can then inspect and explore the raw data to determine if they want to configure the workflow to use additional fields from NCBI. The rule is added to `fetch_from_ncbi.smk` to make it easy to run without additional configs. Note that it is not run as part of the default workflow and only intended to be used as a specified target. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment) Co-authored-by: James Hadfield <[email protected]>
Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.opa6649r0u7j
This commit's main purpose is to split out the tutorials for more complex workflows into a separate section than our existing "Quickstart" zika-tutorial. This new section of tutorials was motivated by the SAB meeting¹ that emphasized the need to keep the simple zika-tutorial for new users. The new section only holds the new "Running an ingest workflow" docs, but will be expanded in the future to cover how to run and customize other workflows within an existing pathogen repository. ¹ https://docs.google.com/document/d/1zDwbn16ZRlMMcKLGWWVJ3lYPNwo5_v_WmgBsHv30_lU/edit?disco=AAABKGP5kEc
Intended to mirror the new "Using a pathogen repo" tutorials with tutorials on how to set up each individual workflow in a pathogen repo.
Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.r95jmyuit0s0 Split out the steps to get the uncurated NCBI Dataset data into a snippet that can be shared between the two ingest tutorials.
Explain what each step of the example command is doing to give readers a better understanding.
The pathogen-repo-guide will be updated to included the target `dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset metadata.¹ This commit updates the tutorial to use this new target so that the user does not need to manually run the extra commands to see the raw metadata. ¹ nextstrain/pathogen-repo-guide#38
The previous commit updates the creating-an-ingest-workflow tutorial to use a custom Snakemake target to generate the uncurated NCBI Dataset metadata so we no longer need the separate snippet. I am still providing the extra commands to generate the uncurated data in the running-an-ingest-workflow tutorial because I wanted users to be able to see the uncurated data even if the existing pathogen workflow does _not_ include the custom Snakemake target.
Include our intention to write phylogenetic tutorials
Replace big YAML block with written instructions to create the custom config file based on suggestion from @jameshadfield in review. #195 (comment)
79a0176
to
9255b1a
Compare
Rebased to include the latest changes merged in #191 |
Merging since I got a round of 👍 in today's priorities meeting. |
.. code-block:: | ||
|
||
$ nextstrain build ingest | ||
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. | ||
Building DAG of jobs… | ||
[...a lot of output...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these new code-blocks which show an example shell session with a prompt and command and optionally output should be .. code-block:: console
to properly handle the various bits inside with highlighting and the copy/paste button.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know! I just tried updating to .. code-block::console
locally, I see a difference in syntax highlighting but the copy/paste behavior is still the same where it only copies the command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding missing code-block languages in #197
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…but the copy/paste behavior is still the same where it only copies the command.
Ah! I thought we were using the syntax-aware prompt/output exclusion method of sphinx-copybutton, but we're using the pattern-matching exclusion method.
All good then.
Be explicit about code-block language to get the proper syntax highlighting. Prompted by post-merge review #195 (comment)
Description of proposed changes
Tutorial contents are based on the first draft of the tutorials.
This PR adds two new sections to our tutorials with two separate ingest tutorials
The new sections of tutorials were motivated by the SAB meeting that emphasized the need to keep the simple zika-tutorial for new users.
Related issue(s)
Resolves #179
Related to #188
This PR is based on #191
Checklist