[wip] Add tutorials

nextstrain · Mar 28, 2022 · 59ce62e · 59ce62e
1 parent 7234254
commit 59ce62e
Show file tree

Hide file tree

Showing 12 changed files with 427 additions and 0 deletions.
diff --git a/docs/src/images/dataset-custom-data-highlighted.png b/docs/src/images/dataset-custom-data-highlighted.png
diff --git a/docs/src/images/dataset-custom-data.png b/docs/src/images/dataset-custom-data.png
diff --git a/docs/src/images/dataset-example-data.png b/docs/src/images/dataset-example-data.png
diff --git a/docs/src/images/gisaid-augur-pipeline-download.png b/docs/src/images/gisaid-augur-pipeline-download.png
diff --git a/docs/src/images/gisaid-epicov-search.png b/docs/src/images/gisaid-epicov-search.png
diff --git a/docs/src/images/gisaid-select-sequences-idaho-highlighted.png b/docs/src/images/gisaid-select-sequences-idaho-highlighted.png
diff --git a/docs/src/images/gisaid-select-sequences-idaho.png b/docs/src/images/gisaid-select-sequences-idaho.png
diff --git a/docs/src/images/gisaid-select-sequences.png b/docs/src/images/gisaid-select-sequences.png
diff --git a/docs/src/tutorial/custom-data.rst b/docs/src/tutorial/custom-data.rst
@@ -0,0 +1,157 @@
+Run the workflow using custom data
+==================================
+
+In this tutorial, you will run the workflow using custom focal data in addition to the example reference data. The reference data will serve as background context for the new data.
+
+.. contents:: Table of Contents
+   :local:
+
+Prerequisites
+-------------
+
+1. :doc:`example-data`. These instructions will set up the command line environment used in this tutorial.
+2. You have a GISAID account. `Register <https://www.gisaid.org/registration/register/>`__ if you do not have an account yet. However, registration may take a few days. Follow `alternative data preparation methods <../guides/data-prep.html>`__ in place of **Curate data from GISAID** if you wish to continue this tutorial in the meantime.
+
+Setup
+-----
+
+If you are not already there, change directory to the ``ncov`` directory:
+
+   .. code:: text
+
+      cd ncov
+
+Curate data from GISAID
+-----------------------
+
+We will retrieve 10 sequences from GISAID's EpiCoV database.
+
+1. Navigate to `GISAID <https://www.gisaid.org/>`__ and select **Login**.
+
+   .. image:: ../images/gisaid-homepage.png
+      :width: 400
+      :alt: GISAID login link
+
+2. Login to your GISAID account.
+
+   .. image:: ../images/gisaid-login.png
+      :width: 200
+      :alt: GISAID login
+
+3. In the top left navigation bar, select **EpiCoV** then **Search**.
+
+   .. image:: ../images/gisaid-epicov-search.png
+      :width: 400
+      :alt: GISAID EpiCoV Search
+
+4. Select the first 10 sequences.
+
+   .. image:: ../images/gisaid-select-sequences.png
+      :width: 700
+      :alt: GISAID EpiCoV Search
+
+5. Select **Download** in the bottom right of the search results.
+6. Select **Input for the Augur pipeline** as the download format.
+
+   .. image:: ../images/gisaid-augur-pipeline-download.png
+      :width: 400
+      :alt: GISAID EpiCoV Search
+
+   .. note::
+
+      You may see different download options, but it is fine as long as **Input for the Augur pipeline** is available.
+
+7. Select **Download**.
+8. Download/move the ``.tar`` file into the ``ncov/data/`` directory.
+9. Extract by opening the downloaded ``.tar`` file in your file explorer. It contains two files: one ending with ``.metadata.tsv`` and another with ``.sequences.fasta``.
+10. Rename the files as ``custom.metadata.tsv`` and ``custom.sequences.fasta``.
+
+Run the workflow
+----------------
+
+From within the ``ncov/`` directory, run the ``ncov`` workflow using a pre-written ``--configfile``:
+
+.. code:: text
+
+   nextstrain build . --cores all --configfile ncov-tutorial/custom-data.yaml
+
+Break down the command
+~~~~~~~~~~~~~~~~~~~~~~
+
+The workflow can take several minutes to run. While it is running, you can investigate the contents of ``custom-data.yaml`` (comments excluded):
+
+.. code-block:: yaml
+
+   inputs:
+     - name: reference_data
+       metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz
+       sequences: https://data.nextstrain.org/files/ncov/open/reference/sequences.fasta.xz
+     - name: custom_data
+       metadata: data/custom.metadata.tsv
+       sequences: data/custom.sequences.fasta
+
+   refine:
+     root: "Wuhan-Hu-1/2019"
+
+   builds:
+     custom-build:
+       title: "Build with custom data and example data"
+       subsampling_scheme: all
+       auspice_config: ncov-tutorial/auspice-config-custom-data.json
+
+This is the same as the previous file, with some additions:
+
+1. A second input for the custom data, referencing the metadata and sequences files downloaded from GISAID.
+2. A ``builds`` section that defines one output :term:`docs.nextstrain.org:dataset` using:
+
+   1. A custom name ``custom-build``
+   2. A custom title ``Build with custom data and example data``
+   3. A pre-defined subsampling scheme ``all`` (TODO: add doc link)
+   4. An Auspice config file with the contents:
+
+      .. code-block:: json
+
+         {
+           "colorings": [
+             {
+               "key": "custom_data",
+               "title": "Custom data",
+               "type": "categorical"
+             }
+           ],
+           "display_defaults": {
+             "color_by": "custom_data"
+           }
+         }
+
+      This JSON does two things:
+
+      1. Create a new coloring ``custom_data`` which reflects a special metadata column generated by the ncov workflow. Each data input produces a new final metadata column with categorical values ``yes`` or ``no`` representing whether the sequence was from the input.
+      2. Set the default Color By as the new ``custom_data`` coloring.
+
+   .. note ::
+
+      **Build** is a widely used term with various meanings. In the context of the ncov workflow, the ``builds:`` section defines output :term:`datasets <docs.nextstrain.org:dataset>` to be generated by the workflow (i.e. "build" a dataset).
+
+Visualize the results
+---------------------
+
+Run this command to start the :term:`docs.nextstrain.org:Auspice` server, providing ``auspice/`` as the directory containing output dataset files:
+
+.. code:: text
+
+   nextstrain view auspice/
+
+Navigate to ``http://127.0.0.1:4000/ncov/custom-build``. The resulting :term:`docs.nextstrain.org:dataset` should have similar phylogeny to the previous dataset, with additional sequences:
+
+.. figure:: ../images/dataset-custom-data-highlighted.png
+   :alt: Phylogenetic tree from the "custom data" tutorial as visualized in Auspice
+
+
+1. The custom dataset name ``custom-build`` can be seen in the dataset selector, as well as the dataset URL.
+2. The custom dataset title can be seen at the top of the page.
+3. The custom coloring is used by default. You can see which sequences are from the custom data added in this tutorial.
+
+   .. note::
+
+      You may not see all 10 custom sequences - some can be filtered out due to quality checks built into the ncov workflow.
diff --git a/docs/src/tutorial/example-data.rst b/docs/src/tutorial/example-data.rst
@@ -0,0 +1,93 @@
+Run the workflow using example data
+===================================
+
+The aim of this first tutorial is to introduce our SARS-CoV-2 workflow.
+To do this, we will run the workflow using a small set of reference data which we provide.
+This tutorial leads on to subsequent tutorials where we will walk through more complex scenarios.
+
+.. contents:: Table of Contents
+   :local:
+
+Prerequisites
+-------------
+
+1. :doc:`setup`. These instructions will install all of the software you need to complete this tutorial and others.
+
+Setup
+-----
+
+1. Activate the ``nextstrain`` conda environment:
+
+   .. code:: text
+
+      conda activate nextstrain
+
+2. Change directory to the ``ncov`` directory:
+
+   .. code:: text
+
+      cd ncov
+
+3. Download the example tutorial repository into a new directory ``ncov-tutorial/``:
+
+   .. code:: text
+
+      git clone https://github.com/nextstrain/ncov-tutorial
+
+Run the workflow
+----------------
+
+From within the ``ncov/`` directory, run the ``ncov`` workflow using a configuration file provided in the tutorial directory:
+
+.. code:: text
+
+   nextstrain build . --cores all --configfile ncov-tutorial/example-data.yaml
+
+Break down the command
+~~~~~~~~~~~~~~~~~~~~~~
+
+The workflow can take several minutes to run. While it is running, you can learn about the parts of this command:
+
+- ``nextstrain build .``
+   - This tells the :term:`docs.nextstrain.org:Nextstrain CLI` to :term:`build <docs.nextstrain.org:build (verb)>` the workflow from ``.``, the current directory. All subsequent command-line parameters are passed to the workflow manager, Snakemake.
+- ``--cores all``
+   - This required Snakemake parameter specifies the number of CPU cores to use (`more info <https://snakemake.readthedocs.io/en/stable/executing/cli.html>`_).
+- ``--configfile ncov-tutorial/example-data.yaml``
+   - ``--configfile`` is another Snakemake parameter used to configure the ncov workflow.
+   - ``ncov-tutorial/example-data.yaml`` is a YAML file which provides custom workflow configuration including inputs and outputs. Contents with comments excluded:
+
+      .. code-block:: yaml
+
+         inputs:
+           - name: reference_data
+             metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz
+             sequences: https://data.nextstrain.org/files/ncov/open/reference/sequences.fasta.xz
+
+         refine:
+           root: "Wuhan-Hu-1/2019"
+
+      This provides the workflow with one input named ``reference_data``, which is a small dataset maintained by the Nextstrain team. The metadata and sequences files are downloaded directly from the associated URLs. `See the complete list of SARS-CoV-2 datasets we provide through data.nextstrain.org <https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html>`_.
+
+      The ``refine`` entry specifies the root sequence for the example GenBank data.
+
+      For more information, visit `the complete configuration guide <../reference/configuration.html>`_.
+
+The workflow output produces a new directory ``auspice/`` containing a file ``ncov_default-build.json``, which will be visualized in the following section. The workflow also produces intermediate files in a new ``results/`` directory.
+
+Visualize the results
+---------------------
+
+Run this command to start the :term:`docs.nextstrain.org:Auspice` server, providing ``auspice/`` as the directory containing output dataset files:
+
+.. code:: text
+
+   nextstrain view auspice/
+
+Navigate to ``http://127.0.0.1:4000/ncov/default-build``. The resulting :term:`docs.nextstrain.org:dataset` should show a phylogeny of ~200 sequences:
+
+.. figure:: ../images/dataset-example-data.png
+   :alt: Phylogenetic tree from the "example data" tutorial as visualized in Auspice
+
+.. note::
+
+   You can also view the results by dragging the file ``auspice/ncov_default-build.json`` onto `auspice.us <https://auspice.us>`__.