diff --git a/docs/redirects.yml b/docs/redirects.yml index c33f1f224..d8a770d67 100644 --- a/docs/redirects.yml +++ b/docs/redirects.yml @@ -39,3 +39,7 @@ - type: page from_url: /analysis/setup.html to_url: /tutorial/setup.html + +- type: page + from_url: /videos.html + to_url: /tutorial/videos.html diff --git a/docs/src/images/dataset-custom-data-highlighted.png b/docs/src/images/dataset-custom-data-highlighted.png new file mode 100644 index 000000000..fcd2046cf Binary files /dev/null and b/docs/src/images/dataset-custom-data-highlighted.png differ diff --git a/docs/src/images/dataset-custom-data.png b/docs/src/images/dataset-custom-data.png new file mode 100644 index 000000000..98091dae7 Binary files /dev/null and b/docs/src/images/dataset-custom-data.png differ diff --git a/docs/src/images/dataset-example-data.png b/docs/src/images/dataset-example-data.png new file mode 100644 index 000000000..950321906 Binary files /dev/null and b/docs/src/images/dataset-example-data.png differ diff --git a/docs/src/images/dataset-genomic-surveillance.png b/docs/src/images/dataset-genomic-surveillance.png new file mode 100644 index 000000000..bd8d1a471 Binary files /dev/null and b/docs/src/images/dataset-genomic-surveillance.png differ diff --git a/docs/src/images/gisaid-augur-pipeline-download.png b/docs/src/images/gisaid-augur-pipeline-download.png new file mode 100644 index 000000000..025febada Binary files /dev/null and b/docs/src/images/gisaid-augur-pipeline-download.png differ diff --git a/docs/src/images/gisaid-epicov-search.png b/docs/src/images/gisaid-epicov-search.png new file mode 100644 index 000000000..2012703a7 Binary files /dev/null and b/docs/src/images/gisaid-epicov-search.png differ diff --git a/docs/src/images/gisaid-select-sequences-10-highlighted.png b/docs/src/images/gisaid-select-sequences-10-highlighted.png new file mode 100644 index 000000000..f83a8c05b Binary files /dev/null and b/docs/src/images/gisaid-select-sequences-10-highlighted.png differ diff --git a/docs/src/images/gisaid-select-sequences-idaho-highlighted.png b/docs/src/images/gisaid-select-sequences-idaho-highlighted.png new file mode 100644 index 000000000..af3ec1314 Binary files /dev/null and b/docs/src/images/gisaid-select-sequences-idaho-highlighted.png differ diff --git a/docs/src/index.rst b/docs/src/index.rst index a15d4812c..dc1ec5edc 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -11,7 +11,22 @@ If something in this documentation is broken or unclear, please `open an issue < If you have a specific question, post a note over at the `discussion board `_ -- we're happy to help! .. toctree:: - :maxdepth: 2 + :maxdepth: 1 + :titlesonly: + :caption: Tutorials + :hidden: + + tutorial/intro + tutorial/setup + tutorial/example-data + tutorial/custom-data + tutorial/genomic-surveillance + tutorial/next-steps + tutorial/videos + + +.. toctree:: + :maxdepth: 1 :titlesonly: :caption: Table of contents diff --git a/docs/src/tutorial/custom-data.rst b/docs/src/tutorial/custom-data.rst new file mode 100644 index 000000000..ddaac3c84 --- /dev/null +++ b/docs/src/tutorial/custom-data.rst @@ -0,0 +1,175 @@ +Run using custom data +===================== + +This tutorial builds on the previous tutorial. Here, we will walk through how to supply your own genomic data and analyze this with the example reference data, which will serve as background context for the new data. We will explain how to get the new data from GISAID, but you could replace this data with your own private sequences if needed. + +.. contents:: Table of Contents + :local: + +Prerequisites +------------- + +1. :doc:`example-data`. This tutorial sets up the command line environment used in the following tutorial. +2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../reference/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. + +Setup +----- + +If you are not already there, change directory to the ``ncov`` directory: + + .. code:: text + + cd ncov + +and activate the ``nextstrain`` conda environment: + + .. code:: text + + conda activate nextstrain + +Curate data from GISAID +----------------------- + +We will retrieve 10 sequences from GISAID's EpiCoV database. + +1. Navigate to `GISAID `__ and select **Login**. + + .. image:: ../images/gisaid-homepage.png + :width: 400 + :alt: GISAID login link + +2. Login to your GISAID account. + + .. image:: ../images/gisaid-login.png + :width: 200 + :alt: GISAID login + +3. In the top left navigation bar, select **EpiCoV** then **Search**. + + .. image:: ../images/gisaid-epicov-search.png + :width: 400 + :alt: GISAID EpiCoV Search + +4. Filter to sequences that pass the following criteria: + + 1. Has a complete genome + 2. Has high coverage + 3. Has an exact collection date + + .. image:: ../images/gisaid-select-sequences-10-highlighted.png + :width: 700 + :alt: GISAID EpiCoV select first 10 sequences + +5. Select the first 10 sequences. + +6. Select **Download** in the bottom right of the search results. +7. Select **Input for the Augur pipeline** as the download format. + + .. image:: ../images/gisaid-augur-pipeline-download.png + :width: 400 + :alt: GISAID EpiCoV download as Input for the Augur pipeline + + .. note:: + + You may see different download options, but it is fine as long as **Input for the Augur pipeline** is available. + +8. Select **Download**. +9. Download/move the ``.tar`` file into the ``ncov/data/`` directory. +10. Extract by opening the downloaded ``.tar`` file in your file explorer. It contains a folder prefixed with ``gisaid_auspice_input_hcov-19_`` containing two files: one ending with ``.metadata.tsv`` and another with ``.sequences.fasta``. +11. Rename the files as ``custom.metadata.tsv`` and ``custom.sequences.fasta``. +12. Move the files up to the ``ncov/data/`` directory. +13. Delete the empty ``gisaid_auspice_input_hcov-19_``-prefixed folder and the ``.tar`` file if it is still there. + +.. hint:: + + Read :doc:`the full data prep guide <../reference/data-prep/index>` for other ways to curate custom data. + +Run the workflow +---------------- + +From within the ``ncov/`` directory, run the ``ncov`` workflow using a pre-written ``--configfile``: + +.. code:: text + + nextstrain build . --cores all --configfile ncov-tutorial/custom-data.yaml + +Break down the command +~~~~~~~~~~~~~~~~~~~~~~ + +The workflow can take several minutes to run. While it is running, you can investigate the contents of ``custom-data.yaml`` (comments excluded): + +.. code-block:: yaml + + inputs: + - name: reference_data + metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz + sequences: https://data.nextstrain.org/files/ncov/open/reference/sequences.fasta.xz + - name: custom_data + metadata: data/custom.metadata.tsv + sequences: data/custom.sequences.fasta + + refine: + root: "Wuhan-Hu-1/2019" + + builds: + custom-build: + title: "Build with custom data and example data" + subsampling_scheme: all + auspice_config: ncov-tutorial/auspice-config-custom-data.json + +This is the same as the previous file, with some additions: + +1. A second input for the custom data, referencing the metadata and sequences files downloaded from GISAID. +2. A ``builds`` section that defines one output :term:`docs.nextstrain.org:dataset` using: + + 1. A custom name ``custom-build``, which will be used to create the dataset filename, in this case ``auspice/ncov_custom-build.json``. + 2. A custom title ``Build with custom data and example data``, which will be shown when you visualize the dataset in Auspice. + 3. :ref:`A pre-defined subsampling scheme ` ``all``, which tells the workflow to skip subsampling. + 4. An Auspice config file, ``ncov-tutorial/auspice-config-custom-data.json``, which defines various parameters for how the data should be visualized in Auspice. It has the following contents: + + .. code-block:: json + + { + "colorings": [ + { + "key": "custom_data", + "title": "Custom data", + "type": "categorical" + } + ], + "display_defaults": { + "color_by": "custom_data" + } + } + + This JSON does two things: + + 1. Create a new coloring ``custom_data`` which reflects a special metadata column generated by the ncov workflow. When there is more than one input, each data input produces a new final metadata column with categorical values ``yes`` or ``no`` representing whether the sequence was from the input. + 2. Set the default Color By as the new ``custom_data`` coloring. + + .. note :: + + **Build** is a widely used term with various meanings. In the context of the ncov workflow, the ``builds:`` section defines output :term:`datasets ` to be generated by the workflow (i.e. "build" a dataset). + +Visualize the results +--------------------- + +Run this command to start the :term:`docs.nextstrain.org:Auspice` server, providing ``auspice/`` as the directory containing output dataset files: + +.. code:: text + + nextstrain view auspice/ + +Navigate to http://127.0.0.1:4000/ncov/custom-build. The resulting :term:`docs.nextstrain.org:dataset` should have similar phylogeny to the previous dataset, with additional sequences: + +.. figure:: ../images/dataset-custom-data-highlighted.png + :alt: Phylogenetic tree from the "custom data" tutorial as visualized in Auspice + + +1. The custom dataset name ``custom-build`` can be seen in the dataset selector, as well as the dataset URL. +2. The custom dataset title can be seen at the top of the page. +3. The custom coloring is used by default. You can see which sequences are from the custom data added in this tutorial. + + .. note:: + + You may not see all 10 custom sequences - some can be filtered out due to quality checks built into the ncov workflow. diff --git a/docs/src/tutorial/example-data.rst b/docs/src/tutorial/example-data.rst new file mode 100644 index 000000000..f5dbff60a --- /dev/null +++ b/docs/src/tutorial/example-data.rst @@ -0,0 +1,99 @@ +Run using example data +====================== + +The aim of this first tutorial is to introduce our SARS-CoV-2 workflow. +To do this, we will run the workflow using a small set of reference data which we provide. +This tutorial leads on to subsequent tutorials where we will walk through more complex scenarios. + +.. contents:: Table of Contents + :local: + +Prerequisites +------------- + +1. :doc:`setup`. These instructions will install all of the software you need to complete this tutorial and others. + +Setup +----- + +1. Activate the ``nextstrain`` conda environment: + + .. code:: text + + conda activate nextstrain + +2. Change directory to the ``ncov`` directory: + + .. code:: text + + cd ncov + +3. Download the example tutorial repository into a new directory ``ncov/ncov-tutorial/``: + + .. code:: text + + git clone https://github.com/nextstrain/ncov-tutorial + +Run the workflow +---------------- + +From within the ``ncov/`` directory, run the ``ncov`` workflow using a configuration file provided in the tutorial directory: + +.. code:: text + + nextstrain build . --cores all --configfile ncov-tutorial/example-data.yaml + +Break down the command +~~~~~~~~~~~~~~~~~~~~~~ + +The workflow can take several minutes to run. While it is running, you can learn about the parts of this command: + +- ``nextstrain build .`` + - This tells the :term:`docs.nextstrain.org:Nextstrain CLI` to :term:`build ` the workflow from ``.``, the current directory. All subsequent command-line parameters are passed to the workflow manager, Snakemake. +- ``--cores all`` + - This required Snakemake parameter specifies the number of CPU cores to use (:doc:`more info `). +- ``--configfile ncov-tutorial/example-data.yaml`` + - ``--configfile`` is another Snakemake parameter used to configure the ncov workflow. + - ``ncov-tutorial/example-data.yaml`` is a YAML file which provides custom workflow configuration including inputs and outputs. Contents with comments excluded: + + .. code-block:: yaml + + inputs: + - name: reference_data + metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz + sequences: https://data.nextstrain.org/files/ncov/open/reference/sequences.fasta.xz + + refine: + root: "Wuhan-Hu-1/2019" + + This provides the workflow with one input named ``reference_data``, which is a small dataset maintained by the Nextstrain team. The metadata and sequences files are downloaded directly from the associated URLs. :doc:`See the complete list of SARS-CoV-2 datasets we provide through data.nextstrain.org <../reference/remote_inputs>`. + + The ``refine`` entry specifies the root sequence for the example GenBank data. + + For more information, see :doc:`../reference/configuration-reference`. + +The workflow output produces a new directory ``auspice/`` containing a file ``ncov_default-build.json``, which will be visualized in the following section. The workflow also produces intermediate files in a new ``results/`` directory. + +Visualize the results +--------------------- + +Run this command to start the :term:`docs.nextstrain.org:Auspice` server, providing ``auspice/`` as the directory containing output dataset files: + +.. code:: text + + nextstrain view auspice/ + +Navigate to http://127.0.0.1:4000/ncov/default-build. The resulting :term:`docs.nextstrain.org:dataset` should show a phylogeny of ~200 sequences: + +.. figure:: ../images/dataset-example-data.png + :alt: Phylogenetic tree from the "example data" tutorial as visualized in Auspice + +To stop the server, press :kbd:`Control-C` on your keyboard. + +.. note:: + + You can also view the results by dragging the dataset files all at once onto `auspice.us `__: + + - ``auspice/ncov_default-build.json`` + - ``auspice/ncov_default-build_root-sequence.json`` + - ``auspice/ncov_default-build_tip-frequencies.json`` diff --git a/docs/src/tutorial/genomic-surveillance.rst b/docs/src/tutorial/genomic-surveillance.rst new file mode 100644 index 000000000..35701cb04 --- /dev/null +++ b/docs/src/tutorial/genomic-surveillance.rst @@ -0,0 +1,165 @@ +Run using a genomic surveillance configuration +============================================== + +In the previous tutorial we showed how to analyze a small set of GISAID ("custom") data in the context of a small set of reference data. For genomic surveillance applications, we often have a set of data specific to our question of interest, for instance a set of sequences from a particular geographic area, which is referred to as the focal set. We want to analyze this focal set in a global context, and since there are millions of global sequences we want to subset these based on genomic proximity to our focal set. + +.. contents:: Table of Contents + :local: + +Prerequisites +------------- + +1. :doc:`custom-data`. This tutorial introduces concepts expanded by the following tutorial. +2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../reference/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. + +Setup +----- + +If you are not already there, change directory to the ``ncov`` directory: + + .. code:: text + + cd ncov + +and activate the ``nextstrain`` conda environment: + + .. code:: text + + conda activate nextstrain + +Curate data from GISAID +----------------------- + +We will download a focal set of Idaho sequences from GISAID's EpiCoV database. + +1. Navigate to `GISAID `__, **Login**, and go to **EpiCoV** > **Search**. + + .. image:: ../images/gisaid-epicov-search.png + :width: 400 + :alt: GISAID EpiCoV Search + +2. Filter to sequences that pass the following criteria: + + 1. From Idaho, USA + 2. Collected within the last month (between ``2022-03-01`` and ``2022-04-01`` at the time of writing) + 3. Has a complete genome + 4. Has an exact collection date + + .. image:: ../images/gisaid-select-sequences-idaho-highlighted.png + :width: 700 + :alt: GISAID EpiCoV filter and select sequences + + .. note:: + + If your selection has more than 200 sequences, adjust the minimum date until it has 200 sequences or less. This ensures the tutorial does not take too long to run. + +3. Select the topmost checkbox in the first column to select all sequences that match the filters. +4. Select **Download** > **Input for the Augur pipeline** > **Download**. +5. Download/move the ``.tar`` file into the ``ncov/data/`` directory. +6. Extract by opening the downloaded ``.tar`` file in your file explorer. It contains a folder prefixed with ``gisaid_auspice_input_hcov-19_`` containing two files: one ending with ``.metadata.tsv`` and another with ``.sequences.fasta``. +7. Rename the files as ``idaho.metadata.tsv`` and ``idaho.sequences.fasta``. +8. Move the files up to the ``ncov/data/`` directory. +9. Delete the empty ``gisaid_auspice_input_hcov-19_``-prefixed folder and the ``.tar`` file if it is still there. + +Run the workflow +---------------- + +From within the ``ncov/`` directory, run the ``ncov`` workflow using a pre-written ``--configfile``: + +.. code:: text + + nextstrain build . --cores all --configfile ncov-tutorial/genomic-surveillance.yaml + +Break down the command +~~~~~~~~~~~~~~~~~~~~~~ + +The workflow can take several minutes to run. While it is running, you can investigate the contents of ``genomic-surveillance.yaml`` (comments excluded): + +.. code-block:: yaml + + inputs: + - name: reference_data + metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz + aligned: https://data.nextstrain.org/files/ncov/open/reference/aligned.fasta.xz + - name: custom_data + metadata: data/idaho.metadata.tsv + sequences: data/idaho.sequences.fasta + - name: background_data + metadata: https://data.nextstrain.org/files/ncov/open/north-america/metadata.tsv.xz + aligned: https://data.nextstrain.org/files/ncov/open/north-america/aligned.fasta.xz + + refine: + root: "Wuhan-Hu-1/2019" + + builds: + idaho: + title: "Idaho-specific genomic surveillance build" + subsampling_scheme: idaho_scheme + auspice_config: ncov-tutorial/auspice-config-custom-data.json + + subsampling: + idaho_scheme: + custom_sample: + query: --query "(custom_data == 'yes')" + max_sequences: 5000 + usa_context: + query: --query "(custom_data != 'yes') & (country == 'USA')" + max_sequences: 1000 + group_by: division year month + priorities: + type: proximity + focus: custom_sample + global_context: + query: --query "(custom_data != 'yes')" + max_sequences: 1000 + priorities: + type: proximity + focus: custom_sample + +This is similar to the previous file. Differences are outlined below, broken down per configuration section. + +inputs +****** + +1. The file paths in the second input are changed to ``idaho.metadata.tsv`` and ``idaho.sequences.fasta``. +2. There is an additional input ``background_data`` for a regional North America dataset built by the Nextstrain team, for additional context. + +builds +****** + +The output dataset is renamed ``idaho``, representative of the new custom data in the second input. + +1. The title is updated. +2. There is a new entry ``subsampling_scheme: idaho_scheme``. This is described in the following section. + +subsampling +*********** + +This is a new section that provides a subsampling scheme ``idaho_scheme`` consisting of three subsamples. Without this, the output dataset would use all the provided data, which in this case is thousands of sequences that are often disproportionally representative of the underlying population. + +1. ``custom_sample`` + + - This selects sequences from the ``custom_data`` input, up to a maximum of 5000 sequences. + +2. ``usa_context`` + + - This selects sequences from the ``background_data`` input, up to a maximum of 1000 sequences. + - Sequences are subsampled evenly across all combinations of ``division``, ``year``, ``month``, with sequences genetically similar to ``custom_sample`` prioritized over other sequences. + +3. ``global_context`` + + - This selects sequences from the ``reference_data`` input. + +Visualize the results +--------------------- + +Run this command to start the :term:`docs.nextstrain.org:Auspice` server, providing ``auspice/`` as the directory containing output dataset files: + +.. code:: text + + nextstrain view auspice/ + +Navigate to http://127.0.0.1:4000/ncov/idaho. The resulting :term:`docs.nextstrain.org:dataset` should show the recent Idaho sequences against a backdrop of historical sequences: + +.. figure:: ../images/dataset-genomic-surveillance.png + :alt: Phylogenetic tree from the "genomic surveillance" tutorial as visualized in Auspice diff --git a/docs/src/tutorial/index.rst b/docs/src/tutorial/index.rst index b5a89a97b..f195fe0a5 100644 --- a/docs/src/tutorial/index.rst +++ b/docs/src/tutorial/index.rst @@ -9,4 +9,7 @@ Tutorial intro setup + example-data + custom-data + genomic-surveillance running diff --git a/docs/src/tutorial/intro.rst b/docs/src/tutorial/intro.rst index b81ce9915..ac8876623 100644 --- a/docs/src/tutorial/intro.rst +++ b/docs/src/tutorial/intro.rst @@ -9,9 +9,5 @@ At the end, you will be able to: - create phylogenetic trees of SARS-CoV-2 genomes from different sources including GISAID and Nextstrain-curated GenBank data - visualize the resulting trees in :term:`docs.nextstrain.org:Auspice` - define subsampling logic for your own genomic epidemiological analysis - -After completing these tutorials, you may wish to `learn more about genomic epidemiology `_ or `review all possible options to configure your SARS-CoV-2 analyses with Nextstrain <../reference/configuration.html>`_. -If you prefer video format to working through these written tutorials, check out the :doc:`video tutorial walkthrough <../videos>`. - -We also recommend `this 1-hour video overview `_ by Heather Blankenship on how to deploy Nextstrain for a Public Health lab. +If you prefer to learn about the workflow through videos, see the :doc:`demo videos `. diff --git a/docs/src/tutorial/next-steps.rst b/docs/src/tutorial/next-steps.rst new file mode 100644 index 000000000..e4ad3f597 --- /dev/null +++ b/docs/src/tutorial/next-steps.rst @@ -0,0 +1,76 @@ +========== +Next steps +========== + +Congratulations! You have completed all of the tutorials for the ncov workflow. Read on for some next steps. + +.. contents:: Table of Contents + :local: + +Create your own analysis directory +================================== + +On a web browser: + +1. `Sign up for a GitHub account `__ if you do not already have one. +2. Create a repository from the ``ncov-tutorial`` template repository: + + 1. Go to https://github.com/nextstrain/ncov-tutorial. + 2. Click **Use this template**. + 3. Give your repository a name. We recommend ``my-ncov-analyses`` and will use that name in the following steps. + 4. Click **Create repository from template**. + +In a command prompt: + +1. Go to the ``ncov/`` directory. +2. Clone your new repository, replacing ```` with your own username: + + .. code:: text + + git clone https://github.com//my-ncov-analyses + +3. Read the next section to learn how to modify ``genomic-surveillance.yaml``. + +Modify the genomic surveillance workflow configuration +====================================================== + +Instead of an Idaho-focused workflow config, you can provide your own data for the ``custom_data`` input. Follow the same steps in the tutorial for GISAID download but select your own set of sequences and rename your ``metadata.tsv`` and ``sequences.fasta`` files accordingly. + + .. note:: + + Workflow run time increases with the number of sequences, and the GISAID web interface has a maximum of 5,000 sequences per download. + +Then, use the following steps to customize names, titles, and context: + +1. Change the ``custom_data`` input filenames from ``idaho.metadata.tsv`` and ``idaho.sequences.fasta`` to your own. +2. Change the regional input dataset from North America to an appropriate region for your custom focal data. :doc:`See the complete list of available URLs <../reference/remote_inputs>`. +3. Rename the output dataset from ``idaho`` to your own. Note the name restrictions. +4. Rename the subsampling scheme from ``idaho_scheme`` to your own. Note the name restrictions. +5. Reword the output dataset title to your own. +6. Rename the ``usa_context`` sample and update the ``query`` accordingly. + +.. warning:: + + File paths in the :term:`config files ` must start with the :term:`analysis directory`. For example, in the tutorial: + + .. code:: yaml + + auspice_config: ncov-tutorial/auspice-config-custom-data.json + + Now that you have created your own analysis directory, this must be modified, e.g. + + .. code:: yaml + + auspice_config: my-ncov-analyses/auspice-config-custom-data.json + +Additional resources +==================== + +- Learn more about genomic epidemiology: + + - `An applied genomic epidemiological handbook `__ by Allison Black and Gytis Dudas + - `Genomic Epidemiology Seminar Series `__ by Chan Zuckerberg Initiative Genomic Epidemiology (CZ GEN EPI) + - `COVID-19 Genomic Epidemiology Toolkit `__ by Centers for Disease Control and Prevention (CDC) + +- :doc:`Review all possible options to configure your SARS-CoV-2 analyses with Nextstrain <../reference/configuration-reference>`. +- Watch `this 1-hour video overview `__ by Heather Blankenship on how to deploy Nextstrain for a Public Health lab. diff --git a/docs/src/videos.rst b/docs/src/tutorial/videos.rst similarity index 85% rename from docs/src/videos.rst rename to docs/src/tutorial/videos.rst index 1e38235b4..093e32a43 100644 --- a/docs/src/videos.rst +++ b/docs/src/tutorial/videos.rst @@ -2,7 +2,7 @@ Video tutorial walkthrough ************************** -If you prefer video format to working through this tutorial in the written documentation, check out the videos below. +If you prefer to learn about the workflow through videos, see the following: Running the analysis --------------------