From 33b43774db361280a07fb464dca51366ea35dd28 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Fri, 1 Apr 2022 16:10:48 -0700 Subject: [PATCH] Update existing content - Organizational changes: - Expose pages in main sidebar (6e35cdc6) - Move pages to guides: - "Update the workflow" section from tutorial/setup -> guides/update-workflow - reference/customizing-analysis -> guides/workflow-config-file - reference/customizing-visualization -> guides/customizing-visualization - reference/data-prep -> guides/data-prep - Split "Data Prep" into 3 pages - Add reference/glossary - Rename reference files: - configuration -> workflow-config-file - orientation-files -> files - orientation-workflow -> nextstrain-overview - tutorial/running -> troubleshoot - Remove files: - reference/multiple_inputs - Changes across multiple files: - Fix MD->reST conversion glitches - Reference "builds.yaml" as "workflow config file" - Remove my_profiles/ references - Reference glossary terms where appropriate - Use sphinx reference directive [1] to link to specific sections - Per-file changes: - tutorial/setup - Remove basic example in setup page (replaced by the "example data" tutorial) - reference/gisaid-search - Remove off-topic line - reference/nextstrain-overview - Capitalize Augur, Auspice, Snakemake, Nextflow - Describe build vs. workflow - reference/files - Re-organize page with "user files" vs. "internal files" - reference/troubleshoot - Formerly tutorial/running, it has been stripped down to just troubleshooting content - dev_docs - Link to docs for installation/setup [1]: https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#cross-referencing-arbitrary-locations --- defaults/description.md | 9 +- docs/dev_docs.md | 46 +- docs/redirects.yml | 28 +- ...meters.css => configuration-reference.css} | 12 +- docs/src/conf.py | 2 +- docs/src/guides/customizing-visualization.rst | 111 ++++ docs/src/guides/data-prep.rst | 472 ------------------ docs/src/guides/data-prep/gisaid-full.rst | 147 ++++++ docs/src/guides/data-prep/gisaid-search.rst | 209 ++++++++ docs/src/guides/data-prep/index.rst | 22 + docs/src/guides/data-prep/local-data.rst | 84 ++++ docs/src/guides/index.rst | 11 - docs/src/guides/update-workflow.rst | 35 ++ .../workflow-config-file.rst} | 56 ++- ...e_build.png => basic_nextstrain_build.png} | Bin docs/src/index.rst | 48 +- .../reference/customizing-visualization.rst | 111 ---- docs/src/reference/data_submitter_faq.rst | 30 +- docs/src/reference/files.rst | 78 +++ docs/src/reference/glossary.rst | 38 ++ docs/src/reference/index.rst | 20 - docs/src/reference/metadata-fields.rst | 118 +++-- docs/src/reference/multiple_inputs.md | 225 --------- docs/src/reference/naming_clades.rst | 25 +- docs/src/reference/nextstrain-overview.rst | 51 ++ docs/src/reference/orientation-files.rst | 50 -- docs/src/reference/orientation-workflow.rst | 47 -- docs/src/reference/remote_inputs.rst | 14 +- docs/src/reference/troubleshoot.rst | 63 +++ ...iguration.rst => workflow-config-file.rst} | 24 +- docs/src/tutorial/custom-data.rst | 4 +- docs/src/tutorial/example-data.rst | 2 +- docs/src/tutorial/genomic-surveillance.rst | 2 +- docs/src/tutorial/index.rst | 15 - docs/src/tutorial/next-steps.rst | 4 +- docs/src/tutorial/running.rst | 138 ----- docs/src/tutorial/setup.rst | 66 +-- docs/src/visualization/index.rst | 14 - docs/src/visualization/interpretation.rst | 4 +- docs/src/visualization/narratives.rst | 7 +- docs/src/visualization/sharing.rst | 23 +- 41 files changed, 1098 insertions(+), 1367 deletions(-) rename docs/src/_static/css/{configuration-parameters.css => configuration-reference.css} (69%) create mode 100644 docs/src/guides/customizing-visualization.rst delete mode 100644 docs/src/guides/data-prep.rst create mode 100644 docs/src/guides/data-prep/gisaid-full.rst create mode 100644 docs/src/guides/data-prep/gisaid-search.rst create mode 100644 docs/src/guides/data-prep/index.rst create mode 100644 docs/src/guides/data-prep/local-data.rst delete mode 100644 docs/src/guides/index.rst create mode 100644 docs/src/guides/update-workflow.rst rename docs/src/{reference/customizing-analysis.rst => guides/workflow-config-file.rst} (56%) rename docs/src/images/{basic_snakemake_build.png => basic_nextstrain_build.png} (100%) delete mode 100644 docs/src/reference/customizing-visualization.rst create mode 100644 docs/src/reference/files.rst create mode 100644 docs/src/reference/glossary.rst delete mode 100644 docs/src/reference/index.rst delete mode 100644 docs/src/reference/multiple_inputs.md create mode 100644 docs/src/reference/nextstrain-overview.rst delete mode 100644 docs/src/reference/orientation-files.rst delete mode 100644 docs/src/reference/orientation-workflow.rst create mode 100644 docs/src/reference/troubleshoot.rst rename docs/src/reference/{configuration.rst => workflow-config-file.rst} (95%) delete mode 100644 docs/src/tutorial/index.rst delete mode 100644 docs/src/tutorial/running.rst delete mode 100644 docs/src/visualization/index.rst diff --git a/defaults/description.md b/defaults/description.md index d203adea8..8d3454dc4 100644 --- a/defaults/description.md +++ b/defaults/description.md @@ -1,5 +1,6 @@ -Hi! This is the default description. Edit me in `my_profiles//description.md`, and add this line to your `my_profiles/builds.yaml` file: -``` -files: - description: my_profiles//description.md +Hi! This is the default description, written in [Markdown](https://www.markdownguide.org/getting-started/). You can change this by creating another Markdown file and referencing it in the workflow config file: + +```yaml +files: + description: path/to/description.md ``` diff --git a/docs/dev_docs.md b/docs/dev_docs.md index 23c165604..9385c820c 100644 --- a/docs/dev_docs.md +++ b/docs/dev_docs.md @@ -7,53 +7,9 @@ 1. [Running](#running) 1. [Releasing new workflow versions](#releasing-new-workflow-versions) -## Setup - -Please see [the main Nextstrain docs](https://nextstrain.org/docs/getting-started/introduction#open-source-tools-for-the-community) for instructions for installing the Nextstrain bioinformatics pipeline (Augur) and visualization tools (Auspice). - -## Data - -In order to run the Nextstrain build you must provision `./data/sequences.fasta` and `./data/metadata.tsv`. -We've included a test set of sequences that are publicly available via Genbank as `./example_data/sequences.fasta`. - ## Running -Please see [these docs](./docs/running.md) for instructions on how to run this build yourself. - -The resulting output JSON at `auspice/ncov.json` can be visualized by running `auspice view --datasetDir auspice` or `nextstrain view auspice/` depending on local vs containerized installation. - -### Finalizing automated builds - -To run a regional build, be sure to update the list of regions in `nextstrain_profiles/nextstrain-gisaid/builds.yaml`. -You can run all builds in parallel with the following command. - -```bash -snakemake --profile nextstrain_profiles/nextstrain-gisaid all_regions -``` - -Or you can specify final or intermediate output files like so: - -```bash -# subsampled regional focal -snakemake --profile nextstrain_profiles/nextstrain-gisaid auspice/ncov_europe.json - -# subsampled global -snakemake --profile nextstrain_profiles/nextstrain-gisaid auspice/ncov_global.json -``` - -To update ordering/lat_longs after AWS download: - -```bash -snakemake --touch --forceall --profile nextstrain_profiles/nextstrain-gisaid -snakemake --profile nextstrain_profiles/nextstrain-gisaid clean_export_regions -snakemake --profile nextstrain_profiles/nextstrain-gisaid export_all_regions -``` - -When done adjusting lat-longs & orders, remember to run the following command to produce the final Auspice files. - -```bash -snakemake --profile nextstrain_profiles/nextstrain-gisaid all_regions -``` +Visit [the workflow documentation](https://docs.nextstrain.org/projects/ncov) for instructions on how to set up and run the workflow. ## Releasing new workflow versions diff --git a/docs/redirects.yml b/docs/redirects.yml index d8a770d67..c41dfb92a 100644 --- a/docs/redirects.yml +++ b/docs/redirects.yml @@ -14,27 +14,27 @@ - type: page from_url: /analysis/customizing-analysis.html - to_url: /reference/customizing-analysis.html + to_url: /guides/workflow-config-file.html - type: page from_url: /analysis/customizing-visualization.html - to_url: /reference/customizing-visualization.html + to_url: /guides/customizing-visualization.html - type: page from_url: /analysis/data-prep.html - to_url: /guides/data-prep.html + to_url: /guides/data-prep/index.html - type: page from_url: /analysis/orientation-files.html - to_url: /reference/orientation-files.html + to_url: /reference/files.html - type: page from_url: /analysis/orientation-workflow.html - to_url: /reference/orientation-workflow.html + to_url: /reference/nextstrain-overview.html - type: page from_url: /analysis/running.html - to_url: /tutorial/running.html + to_url: /reference/troubleshoot.html - type: page from_url: /analysis/setup.html @@ -43,3 +43,19 @@ - type: page from_url: /videos.html to_url: /tutorial/videos.html + +- type: page + from_url: /reference/configuration.html + to_url: /reference/workflow-config-file.html + +- type: page + from_url: /reference/multiple_inputs.html + to_url: / + +- type: page + from_url: /visualization/index.html + to_url: /visualization/sharing.html + +- type: page + from_url: /guides/index.html + to_url: /guides/run-analysis-on-terra.html diff --git a/docs/src/_static/css/configuration-parameters.css b/docs/src/_static/css/configuration-reference.css similarity index 69% rename from docs/src/_static/css/configuration-parameters.css rename to docs/src/_static/css/configuration-reference.css index bbcf53aba..fe44fd464 100644 --- a/docs/src/_static/css/configuration-parameters.css +++ b/docs/src/_static/css/configuration-reference.css @@ -1,5 +1,5 @@ -/* Custom CSS to be applied to the reference/configuration.rst - page. That page defines a customclass of .configurationparameters */ +/* Custom CSS to be applied to the reference/workflow-config-file.rst + page. That page defines a custom class of .configuration-reference */ /* We detail a lot of nested (snakemake) configuration entries in the @@ -9,21 +9,21 @@ The following style changes are intended to convey that certain config entries are children of a higher-level config, rather than being top-level config parameters themselves */ -.configurationparameters h4 { +.configuration-reference h4 { font-size: 100%; } /* Pad lists generated by a (local) contents directive showing sub-keys */ -.configurationparameters section > section > div.contents.local.topic { +.configuration-reference section > section > div.contents.local.topic { margin-left: 24px; /* same as a nested
  • */ margin-top: -20px; /* CSS can't select previous sibling, FYI */ } -.configurationparameters section > section > div.contents.local.topic > ul > li { +.configuration-reference section > section > div.contents.local.topic > ul > li { list-style: circle; } /* pad out their siblings (which come _after_ the list) so that they are in line with the start of text in the preceding
  • element */ -.configurationparameters section > section > div.contents.local.topic ~ * { +.configuration-reference section > section > div.contents.local.topic ~ * { margin-left: 48px; } diff --git a/docs/src/conf.py b/docs/src/conf.py index 618182c2e..2749e4c33 100644 --- a/docs/src/conf.py +++ b/docs/src/conf.py @@ -102,7 +102,7 @@ def prose_list(items): # or fully qualified paths (eg. https://...) html_css_files = [ 'css/custom.css', - 'css/configuration-parameters.css' + 'css/configuration-reference.css' ] # -- Cross-project references ------------------------------------------------ diff --git a/docs/src/guides/customizing-visualization.rst b/docs/src/guides/customizing-visualization.rst new file mode 100644 index 000000000..5c24d5e6d --- /dev/null +++ b/docs/src/guides/customizing-visualization.rst @@ -0,0 +1,111 @@ +Customizing visualization +========================= + +Visualization options can be configured in either a :term:`workflow config file` or a :term:`Auspice config file`, depending on the option. + +.. contents:: Table of Contents + :local: + +Options in the workflow config file +----------------------------------- + +These options can be coded into the workflow config file directly without requiring a custom Auspice config file. + +Custom color schemes +~~~~~~~~~~~~~~~~~~~~ + +To specify a custom color scale: + +1. Add a ``colors.tsv`` file, where each line is a tab-delimited list of a metadata column name; a metadata value; and a corresponding hex code. Example: + + :: + + country Russia #5E1D9D + country Serbia #4D22AD + country Europe #4530BB + ... + +2. Update your workflow config file with a reference: + + .. code:: yaml + + files: + colors: "my-ncov-analyses/colors.tsv" + +Changing the dataset description +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The dataset description, which appears below the visualizations, is read from a file which is specified in the workflow config file. Per-build description can be set by specifying them in the workflow config file: + +.. code:: yaml + + builds: + north-america: # name of the build + description: my-ncov-analyses/north-america-description.md + +If that is not provided, then a per-run description is used, also specified in the workflow config file: + +.. code:: yaml + + files: + description: my-ncov-analyses/my_description.md + +Options in the Auspice config file +---------------------------------- + +These options require creating an Auspice config file, used to configure :term:`docs.nextstrain.org:Auspice`. It is specified in the workflow config file using the ``auspice_config`` entry. Example: + +.. code:: yaml + + auspice_config: ncov-tutorial/auspice-config-custom-data.json + +This overrides the default Auspice config file, ``defaults/auspice_config.json``. + +Adding custom metadata fields to color by +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Add a :doc:`valid metadata column <./data-prep/local-data>` to your ``metadata.tsv`` +2. Add an entry to the ``colorings`` block of the Auspice config file: + + .. code:: json + + "colorings": [ + { + "key": "location", + "title": "Location", + "type": "categorical" + }, + { + "key": "metadata_column_name", + "title": "Display name for interface", + "type": "" + } + ] + +Choosing defaults +~~~~~~~~~~~~~~~~~ + +You can specify the default view in the ``display_defaults`` block of an Auspice config file: + +.. code:: json + + "display_defaults": { + "color_by": "division", + "distance_measure": "num_date", + "geo_resolution": "division", + "map_triplicate": true, + "branch_label": "none" + }, + +Choosing panels to display +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Similarly, you can choose which panels to enable in the ``panels`` block: + +.. code:: json + + "panels": [ + "tree", + "map", + "entropy" + ] diff --git a/docs/src/guides/data-prep.rst b/docs/src/guides/data-prep.rst deleted file mode 100644 index 345ae3680..000000000 --- a/docs/src/guides/data-prep.rst +++ /dev/null @@ -1,472 +0,0 @@ -Preparing your data -=================== - -.. raw:: html - -

    - -We’ve prepared an example dataset in the ``data`` directory. If you’d like to move ahead with this tutorial with this example dataset, skip to the next section. If you’d like to prepare your own data, read on. - -.. raw:: html - -

    - -To use Nextstrain to analyze your own data, you’ll need to prepare two files: - -1. A ``fasta`` file with viral genomic sequences -2. A corresponding ``tsv`` file with metadata describing each sequence - -We describe the following ways to prepare data for a SARS-CoV-2 analysis: - -1. Prepare your own local data for quality control prior to submission to a public database. -2. Curate data from GISAID search and downloads to prepare a regional analyses based on local sequences identified through GISAID’s search interface and contextual sequences for your region from GISAID’s “nextregions” downloads. -3. Curate data from the full GISAID database to prepare a custom analysis by downloading the full database and querying for specific strains locally with `Augur `__. - -Prepare your own local data ---------------------------- - -Formatting your sequence data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The first 2 lines in ``data/sequences.fasta`` look like this: - -:: - - >Wuhan-Hu-1/2019 - ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATC..... - -**The first line is the ``strain`` or ``name`` of the sequence.** Lines with names in FASTA files always start with the ``>`` character (this is not part of the name), and may not contain spaces or ``()[]{}|#><``. Note that “strain” here carries no biological or functional significance and should largely be thought of as synonymous with “sample.” - -The sequence itself is a `consensus genome `__. - -**By default, sequences less than 27,000 bases in length or with more than 3,000 ``N`` (unknown) bases are omitted from the analysis.** For a basic QC and preliminary analysis of your sequence data, you can use `clades.nextstrain.org `__. This tool will check your sequences for excess divergence, clustered differences from the reference, and missing or ambiguous data. In addition, it will assign nextstrain clades and call mutations relative to the reference. - -Formatting your metadata -~~~~~~~~~~~~~~~~~~~~~~~~ - -Nextstrain accommodates many kinds of metadata, so long as it is in a ``TSV`` format. A ``TSV`` is a text file, where each row (line) represents a sample and each column (separated by tabs) represents a field. - - If you’re unfamiliar with TSV files, don’t fret; it’s straightforward to export these directly from Excel, which we’ll cover shortly. - -Here’s an example of the first few columns of the metadata for a single strain, including the header row. *(Spacing between columns here is adjusted for clarity, and only the first 6 columns are shown).* - -:: - - strain virus gisaid_epi_isl genbank_accession date region ... - NewZealand/01/2020 ncov EPI_ISL_413490 ? 2020-02-27 Oceania ... - -`See the reference guide on metadata fields for more details <../reference/metadata-fields.md>`__. - -Required metadata -^^^^^^^^^^^^^^^^^ - -A valid metadata file must include the following fields: - -+------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ -| Field | Example value | Description | Formatting | -+========================+=======================================================================================+=============================+===============================================================================================================================+ -| ``strain`` or ``name`` | ``Australia/NSW01/2020`` | Sample name / ID | Each header in the fasta file must exactly match a ``strain`` value in the metadata. Characters ``()[]{}|#><`` are disallowed | -+------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ -| ``date`` | ``2020-02-27``, ``2020-02-XX``, ``2020-XX-XX`` | Date of *sampling* | ``YYYY-MM-DD``; ambiguities can be indicated with ``XX`` | -+------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ -| ``virus`` | ``ncov`` | Pathogen name | Needs to be consistent | -+------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ -| ``region`` | ``Africa``, ``Asia``, ``Europe``, ``North America``, ``Oceania`` or ``South America`` | Global region of *sampling* | | -+------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ - -Please be aware that **our current pipeline will filter out any genomes with an unknown date - you can change this in your own pipeline.** - -Missing metadata -^^^^^^^^^^^^^^^^ - -Missing data is to be expected for certain fields. In general, **missing data is represented by an empty string or a question mark character.** There is one important difference: if a discrete trait reconstruction (e.g. via ``augur traits``) is to be run on this column, then a value of ``?`` will be inferred, whereas the empty string will be treated as missing data in the output. See below for how to represent uncertainty in sample collection date. - -General formatting tips -^^^^^^^^^^^^^^^^^^^^^^^ - -- **The order of the fields doesn’t matter**; but if you are going to join your metadata with the global collection then it’s easiest to keep them in the same order! -- **Not all fields are currently used**, but this may change in the future. -- Data is **case sensitive** -- The **“geographic” columns, such as “region” and “country” will be used to plot the samples on the map**. Adding a new value to these columns isn’t a problem at all, but there are a few extra steps to take; see the `customization guide <../reference/customizing-analysis.md>`__. -- **You can color by any of these fields in the Auspice visualization**. Which exact columns are used, and which colours are used for each value is completely customisable; see the `customization guide <../reference/customizing-visualization.md>`__. - -Formatting metadata in Excel -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can also create a TSV file in Excel. However, due to issues with auto-formatting of certain fields in Excel (like dates), we don’t recommend this as a first option. If you do edit a file in Excel, open it afterwards in a text editor to check it looks as it should. - -1. Create a spreadsheet where each row is a sample, and each column is a metadata field -2. Ensure your spreadsheet meets the requirements outlined above. Pay special attention to date formats; see `this guide to date formatting in Excel `__. -3. Click on ``File > Save as`` -4. Choose ``Text (Tab delimited) (*.txt)`` and enter a filename ending in ``.tsv`` - -Curate data from GISAID search and downloads --------------------------------------------- - -The following instructions describe how to curate data for a region-specific analysis (e.g., identifying recent introductions into Washington State) using GISAID’s “Search” page and curated regional data from the “Downloads” window. Inferences about a sample’s origin strongly depend on the composition of your dataset. For example, discrete trait analysis models cannot infer transmission from an origin that is not present in your data. We show how to overcome this issue by adding previously curated contextual sequences from Nextstrain to your region-specific dataset. - -Login to GISAID -~~~~~~~~~~~~~~~ - -Navigate to `GISAID (gisaid.org) `__ and select the “Login” link. - -.. figure:: ../images/gisaid-homepage.png - :alt: GISAID homepage with login link - - GISAID homepage with login link - -Login to your GISAID account. If you do not have an account yet, register for one (it’s free) by selecting the “Registration” link. - -.. figure:: ../images/gisaid-login.png - :alt: GISAID login page with registration link - - GISAID login page with registration link - -Select “EpiCoV” from the top navigation bar. - -.. figure:: ../images/gisaid-navigation-bar.png - :alt: GISAID navigation bar with “EpiCoV” link - - GISAID navigation bar with “EpiCoV” link - -Search for region-specific data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Select “Search” from the EpiCoV navigation bar. - -.. figure:: ../images/gisaid-epicov-navigation-bar.png - :alt: GISAID EpiCoV navigation bar with “Search” link - - GISAID EpiCoV navigation bar with “Search” link - -Find the “Location” field and start typing “North America /”. As you type, the field will suggest more specific geographic scales. - -.. figure:: ../images/gisaid-initial-search-interface.png - :alt: GISAID initial search interface - - GISAID initial search interface - -Finish by typing “North America / USA / Washington”. Select all strains collected between May 1 and June 1 with complete genome sequences and collection dates. Click the checkbox in the header row of the results display, to select all strains that match the search parameters. - -.. figure:: ../images/gisaid-search-results.png - :alt: GISAID search results for “Washington” - - GISAID search results for “Washington” - -.. raw:: html - -

    - -GISAID limits the number of records you can download at once to 5000. If you need to download more records, constrain your search results to smaller windows of time by collection date and download data in these smaller batches. - -.. raw:: html - -

    - -Select the “Download” button in the bottom right of the search results. There are two options to download data from GISAID, both of which we describe below. - -Option 1: Download “Input for the Augur pipeline” -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -From the resulting “Download” window, select “Input for the Augur pipeline” as the download format. - -.. figure:: ../images/gisaid-search-download-window.png - :alt: GISAID search download window showing “Input for the Augur pipeline” option - - GISAID search download window showing “Input for the Augur pipeline” option - -Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington.tar``. This tar archive contains compressed metadata and sequences named like ``1622567829294.metadata.tsv.xz`` and ``1622567829294.sequences.fasta.xz``, respectively. - -You can use this tar file as an input for the Nextstrain workflow, as shown below. The workflow will extract the data for you. Create a new configuration file, ``builds.yaml``, in the top-level of the ``ncov`` directory that defines your analysis or “builds”. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: washington - # The workflow will detect and extract the metadata and sequences - # from GISAID tar archives. - metadata: data/gisaid_washington.tar - sequences: data/gisaid_washington.tar - -Next, you can move on to the heading below to get contextual data for your region of interest. Alternately, you can extract the tar file into the ``data/`` directory prior to analysis. - -.. code:: bash - - tar xvf data/gisaid_washington.tar - -Rename the extracted files to match the descriptive name of the original archive. - -.. code:: bash - - mv data/1622567829294.metadata.tsv.xz data/gisaid_washington_metadata.tsv.xz - mv data/1622567829294.sequences.fasta.xz data/gisaid_washington_sequences.fasta.xz - -You can use these extracted files as inputs for the workflow. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: washington - # The workflow also accepts compressed metadata and sequences - # from GISAID. - metadata: data/gisaid_washington_metadata.tsv.xz - sequences: data/gisaid_washington_sequences.fasta.xz - -Option 2: Download “Sequences” and “Patient status metadata” -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Alternately, you can download sequences and metadata as two separate uncompressed files. First, select “Sequences (FASTA)” as the download format. Check the box for replacing spaces with underscores. Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington_sequences.fasta``. - -.. figure:: ../images/gisaid-search-download-window-sequences.png - :alt: GISAID search download window showing “Sequences (FASTA)” option - - GISAID search download window showing “Sequences (FASTA)” option - -From the search results interface, select the “Download” button in the bottom right again. Select “Patient status metadata” as the download format. Select the “Download” button and save the file to ``data/`` with a descriptive name like ``gisaid_washington_metadata.tsv``. - -.. figure:: ../images/gisaid-search-download-window-metadata.png - :alt: GISAID search download window showing “Patient status metadata” option - - GISAID search download window showing “Patient status metadata” option - -You can use these files as inputs for the workflow like so. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: washington - metadata: data/gisaid_washington_metadata.tsv - sequences: data/gisaid_washington_sequences.fasta - -Download contextual data for your region of interest -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Next, select the “Downloads” link from the EpiCoV navigation bar. - -.. figure:: ../images/gisaid-epicov-navigation-bar-with-downloads.png - :alt: GISAID EpiCoV navigation bar with “Downloads” link - - GISAID EpiCoV navigation bar with “Downloads” link - -Scroll to the “Genomic epidemiology” section and select the “nextregions” button. - -.. figure:: ../images/gisaid-downloads-window.png - :alt: GISAID downloads window - - GISAID downloads window - -Select the major region that corresponds to your region-specific data above (e.g., “North America”). - -.. figure:: ../images/gisaid-nextregions-download-window.png - :alt: GISAID “nextregions” download window - - GISAID “nextregions” download window - -Agree to the terms and conditions and download the corresponding file (named like ``ncov_north-america.tar.gz``) to the ``data/`` directory. - -.. figure:: ../images/gisaid-nextregions-download-terms-and-conditions.png - :alt: GISAID “nextregions” download terms and conditions - - GISAID “nextregions” download terms and conditions - -This compressed tar archive contains metadata and sequences corresponding to `a recent Nextstrain build for that region `__ with names like ``ncov_north-america.tsv`` and ``ncov_north-america.fasta``, respectively. For example, the “North America” download contains data from `Nextstrain’s North America build `__. These regional Nextstrain builds contain data from a specific region and contextual data from all other regions in the world. By default, GISAID provides these “nextregions” data in the “Input for the Augur pipeline” format. - -As with the tar archive from the search results above, you can use the “nextregions” compressed tar archives as input to the Nextstrain workflow and the workflow will extract the appropriate contents for you. For example, you could update your ``inputs`` in the ``builds.yaml`` file from above to include the North American data as follows. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: washington - # The workflow will detect and extract the metadata and sequences - # from GISAID tar archives. - metadata: data/gisaid_washington.tar - sequences: data/gisaid_washington.tar - - name: north-america - # The workflow will similarly detect and extract metadata and - # sequences from compressed tar archives. - metadata: data/ncov_north-america.tar.gz - sequences: data/ncov_north-america.tar.gz - -Alternately, you can extract the data from the compressed tar archive into the ``data/`` directory. - -.. code:: bash - - tar zxvf data/ncov_north-america.tar.gz - -You can use these extracted files as inputs for the workflow. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: washington - # The workflow will detect and extract the metadata and sequences - # from GISAID tar archives. - metadata: data/gisaid_washington.tar - sequences: data/gisaid_washington.tar - - name: north-america - # The workflow supports uncompressed or compressed input files. - metadata: data/ncov_north-america.tsv - sequences: data/ncov_north-america.fasta - -By default, the workflow will use all distinct sequences to create a phylogeny without any subsampling. You now have all of the data you need to run your analysis and can `continue to the next section of the tutorial <../reference/orientation-workflow.md>`__. - -Curate data from the full GISAID database ------------------------------------------ - -Some analyses require custom subsampling of the full GISAID database to most effectively understand SARS-CoV-2 evolution. For example, analyses that investigate specific variants or transmission patterns within localized outbreaks benefit from customized contextual data. These specific searches can easily exceed the 5000-record download limit from GISAID’s search interface and the diversity of data available in the Nextstrain “nextregions” downloads. - -The following instructions describe how to curate data for a region-specific analysis using the full GISAID sequence and metadata files. As with `the curation process described above <#curate-data-from-gisaid-search-and-downloads>`__, we describe how to select contextual data from the rest of the world to improve estimates of introductions to your region. This type of analysis also provides a path to selecting contextual data that are as genetically similar as possible to your region’s data. - -In this example, we will select the following subsets of GISAID data: - -1. all data from Washington State in the last two months -2. a random sample of data from North America (excluding Washington) in the last two months -3. a random sample of data from outside North America in the last six months - -Download all SARS-CoV-2 metadata and sequences from GISAID -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The following instructions assume you have already registered for a free GISAID account, logged into that account, and selected the “EpiCoV” link from the navigation bar, `as described above <#login-to-gisaid>`__. Select the “Downloads” link from the EpiCoV navigation bar. - -.. figure:: ../images/gisaid-epicov-navigation-bar-with-downloads.png - :alt: GISAID EpiCoV navigation bar with “Downloads” link - - GISAID EpiCoV navigation bar with “Downloads” link - -Find the “Download packages” section and select the “FASTA” button. - -.. figure:: ../images/gisaid-download-packages-window.png - :alt: GISAID download window with the “Download packages” sections - - GISAID download window with the “Download packages” sections - -Agree to the terms and conditions and download the corresponding file (named like ``sequences_fasta_2021_06_01.tar.xz``) to the ``data/`` directory. Next, select the “metadata” button from that same “Download packages” section and download the corresponding file (named like ``metadata_tsv_2021_06_01.tar.xz``) to the ``data/`` directory. - -.. raw:: html - -

    - -If “FASTA” or “metadata” options do not appear in the “Download packages” window, use the “Contact” link in the top-right of the GISAID website to request access to these files. - -.. raw:: html - -

    - -`We use these data in our official Nextstrain builds `__. If you have sufficient computing resources, you can use these files as ``inputs`` for the workflow in a ``builds.yaml`` like the one described above. However, the workflow starts by aligning all input sequences to a reference and this alignment can take hours to complete even with multiple cores. As an alternative, we show how to select specific data from these large files prior to starting the workflow. - -Prepare GISAID data for Augur -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Nextstrain’s bioinformatics toolkit, `Augur `__, does not support GISAID’s default formatting (e.g., spaces are not allowed in sequence ids, additional metadata in the FASTA defline is unnecessary, “hCoV-19/” prefixes are not consistently used across all databases, composite “location” fields in the metadata are not tab-delimited, etc.). As a result, the workflow includes tools to prepare GISAID data for processing by Augur. - -First, prepare the sequence data. This step strips prefixes from strain ids in sequence records, removes whitespace from the strain ids, removes additional metadata in the FASTA defline, and removes duplicate sequences present for the same strain id. - -.. code:: bash - - python3 scripts/sanitize_sequences.py \ - --sequences data/sequences_fasta_2021_06_01.tar.xz \ - --strip-prefixes "hCoV-19/" \ - --output data/sequences_gisaid.fasta.gz - -To speed up filtering steps later on, index the sequences with Augur. This command creates a tab-delimited file describing the composition of each sequence. - -.. code:: bash - - augur index \ - --sequences data/sequences_gisaid.fasta.gz \ - --output data/sequence_index_gisaid.tsv.gz - -Next, prepare the metadata. This step resolves duplicate records for the same strain name using GISAID’s ``Accession ID`` field (keeping the record with the latest id), parses the composite ``Location`` field into ``region``, ``country``, ``division``, and ``location`` fields, renames special fields to names Augur expects, and strips prefixes from strain names to match the sequence data above. - -.. code:: bash - - python3 scripts/sanitize_metadata.py \ - --metadata data/metadata_tsv_2021_06_01.tar.xz \ - --database-id-columns "Accession ID" \ - --parse-location-field Location \ - --rename-fields 'Virus name=strain' 'Accession ID=gisaid_epi_isl' 'Collection date=date' \ - --strip-prefixes "hCoV-19/" \ - --output data/metadata_gisaid.tsv.gz - -Select region-specific data -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Select data corresponding to your region of interest. In this example, we select strains from Washington State collected between April 1 and June 1, 2021. The ``--query`` argument of the ``augur filter`` command supports `any valid pandas-style queries on the metadata as a data frame `__. - -.. code:: bash - - augur filter \ - --metadata data/metadata_gisaid.tsv.gz \ - --query "(country == 'USA') & (division == 'Washington')" \ - --min-date 2021-04-01 \ - --max-date 2021-06-01 \ - --exclude-ambiguous-dates-by any \ - --output-strains strains_washington.txt - -The output is a text file with a list of strains that match the given filters with one name per line. As of June 1, 2021, the corresponding output contains 8,193 strains. - -Select contextual data for your region of interest -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Select a random sample of recent data from your region’s continent. In this example, we will randomly sample 1,000 strains collected between April 1 and June 1, 2021 from North American data, excluding data we’ve already selected from Washington. - -.. code:: bash - - augur filter \ - --metadata data/metadata_gisaid.tsv.gz \ - --query "(region == 'North America') & (division != 'Washington')" \ - --min-date 2021-04-01 \ - --max-date 2021-06-01 \ - --exclude-ambiguous-dates-by any \ - --subsample-max-sequences 1000 \ - --output-strains strains_north-america.txt - -Select a random sample of recent data from the rest of the world. Here, we will randomly sample 1,000 strains collected between December 1, 2020 and June 1, 2021 from all continents except North America. To evenly sample all regions through time, we also group data by region, year, and month and sample evenly from these groups. - -.. code:: bash - - augur filter \ - --metadata data/metadata_gisaid.tsv.gz \ - --query "region != 'North America'" \ - --min-date 2020-12-01 \ - --max-date 2021-06-01 \ - --exclude-ambiguous-dates-by any \ - --subsample-max-sequences 1000 \ - --group-by region year month \ - --output-strains strains_global.txt - -Extract metadata and sequences for selected strains -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Now that you’ve selected a subset of strains from the full GISAID database, extract the corresponding metadata and sequences to use as inputs for the Nextstrain workflow. - -.. code:: bash - - augur filter \ - --metadata data/metadata_gisaid.tsv.gz \ - --sequence-index data/sequence_index_gisaid.tsv.gz \ - --sequences data/sequences_gisaid.fasta.gz \ - --exclude-all \ - --include strains_washington.txt strains_north-america.txt strains_global.txt \ - --output-metadata data/subsampled_metadata_gisaid.tsv.gz \ - --output-sequences data/subsampled_sequences_gisaid.fasta.gz - -You can use these extracted files as inputs for the workflow. - -.. code:: yaml - - # Define inputs for the workflow. - inputs: - - name: subsampled-gisaid - metadata: data/subsampled_metadata_gisaid.tsv.gz - sequences: data/subsampled_sequences_gisaid.fasta.gz - -Subsampling ------------ - -We’ve outlined several methods for subsampling, including builds with a focus area and genetically similar contextual sequences, in the `section on customizing your analysis <../reference/customizing-analysis#subsampling>`__. diff --git a/docs/src/guides/data-prep/gisaid-full.rst b/docs/src/guides/data-prep/gisaid-full.rst new file mode 100644 index 000000000..7953401a2 --- /dev/null +++ b/docs/src/guides/data-prep/gisaid-full.rst @@ -0,0 +1,147 @@ +Curate data from the full GISAID database +========================================= + +Some analyses require custom subsampling of the full GISAID database to most effectively understand SARS-CoV-2 evolution. For example, analyses that investigate specific variants or transmission patterns within localized outbreaks benefit from customized contextual data. These specific searches can easily exceed the 5000-record download limit from GISAID's search interface and the diversity of data available in the Nextstrain “nextregions” downloads. + +The following instructions describe how to curate data for a region-specific analysis using the full GISAID sequence and metadata files. As with `the curation process described above <#curate-data-from-gisaid-search-and-downloads>`__, we describe how to select contextual data from the rest of the world to improve estimates of introductions to your region. This type of analysis also provides a path to selecting contextual data that are as genetically similar as possible to your region's data. + +In this example, we will select the following subsets of GISAID data: + +1. all data from Washington State in the last two months +2. a random sample of data from North America (excluding Washington) in the last two months +3. a random sample of data from outside North America in the last six months + +.. contents:: Table of Contents + :local: + +Download all SARS-CoV-2 metadata and sequences from GISAID +---------------------------------------------------------- + +The following instructions assume you have already registered for a free GISAID account, logged into that account, and selected the “EpiCoV” link from the navigation bar, `as described above <#login-to-gisaid>`__. Select the “Downloads” link from the EpiCoV navigation bar. + +.. figure:: ../../images/gisaid-epicov-navigation-bar-with-downloads.png + :alt: GISAID EpiCoV navigation bar with “Downloads” link + + GISAID EpiCoV navigation bar with “Downloads” link + +Find the “Download packages” section and select the “FASTA” button. + +.. figure:: ../../images/gisaid-download-packages-window.png + :alt: GISAID download window with the “Download packages” sections + + GISAID download window with the “Download packages” sections + +Agree to the terms and conditions and download the corresponding file (named like ``sequences_fasta_2021_06_01.tar.xz``) to the ``data/`` directory. Next, select the “metadata” button from that same “Download packages” section and download the corresponding file (named like ``metadata_tsv_2021_06_01.tar.xz``) to the ``data/`` directory. + +.. warning:: + + If “FASTA” or “metadata” options do not appear in the “Download packages” window, use the “Contact” link in the top-right of the GISAID website to request access to these files. + +`We use these data in our official Nextstrain builds `__. If you have sufficient computing resources, you can use these files as ``inputs`` for the workflow in a the workflow config file like the one described above. However, the workflow starts by aligning all input sequences to a reference and this alignment can take hours to complete even with multiple cores. As an alternative, we show how to select specific data from these large files prior to starting the workflow. + +Prepare GISAID data for Augur +----------------------------- + +Nextstrain's bioinformatics toolkit, :term:`docs.nextstrain.org:Augur`, does not support GISAID's default formatting (e.g., spaces are not allowed in sequence ids, additional metadata in the FASTA defline is unnecessary, “hCoV-19/” prefixes are not consistently used across all databases, composite “location” fields in the metadata are not tab-delimited, etc.). As a result, the workflow includes tools to prepare GISAID data for processing by Augur. + +First, prepare the sequence data. This step strips prefixes from strain ids in sequence records, removes whitespace from the strain ids, removes additional metadata in the FASTA defline, and removes duplicate sequences present for the same strain id. + +.. code:: bash + + python3 scripts/sanitize_sequences.py \ + --sequences data/sequences_fasta_2021_06_01.tar.xz \ + --strip-prefixes "hCoV-19/" \ + --output data/sequences_gisaid.fasta.gz + +To speed up filtering steps later on, index the sequences with Augur. This command creates a tab-delimited file describing the composition of each sequence. + +.. code:: bash + + augur index \ + --sequences data/sequences_gisaid.fasta.gz \ + --output data/sequence_index_gisaid.tsv.gz + +Next, prepare the metadata. This step resolves duplicate records for the same strain name using GISAID's ``Accession ID`` field (keeping the record with the latest id), parses the composite ``Location`` field into ``region``, ``country``, ``division``, and ``location`` fields, renames special fields to names Augur expects, and strips prefixes from strain names to match the sequence data above. + +.. code:: bash + + python3 scripts/sanitize_metadata.py \ + --metadata data/metadata_tsv_2021_06_01.tar.xz \ + --database-id-columns "Accession ID" \ + --parse-location-field Location \ + --rename-fields 'Virus name=strain' 'Accession ID=gisaid_epi_isl' 'Collection date=date' \ + --strip-prefixes "hCoV-19/" \ + --output data/metadata_gisaid.tsv.gz + +Select region-specific data +--------------------------- + +Select data corresponding to your region of interest. In this example, we select strains from Washington State collected between April 1 and June 1, 2021. The ``--query`` argument of the ``augur filter`` command supports `any valid pandas-style queries on the metadata as a data frame `__. + +.. code:: bash + + augur filter \ + --metadata data/metadata_gisaid.tsv.gz \ + --query "(country == 'USA') & (division == 'Washington')" \ + --min-date 2021-04-01 \ + --max-date 2021-06-01 \ + --exclude-ambiguous-dates-by any \ + --output-strains strains_washington.txt + +The output is a text file with a list of strains that match the given filters with one name per line. As of June 1, 2021, the corresponding output contains 8,193 strains. + +Select contextual data for your region of interest +-------------------------------------------------- + +Select a random sample of recent data from your region's continent. In this example, we will randomly sample 1,000 strains collected between April 1 and June 1, 2021 from North American data, excluding data we've already selected from Washington. + +.. code:: bash + + augur filter \ + --metadata data/metadata_gisaid.tsv.gz \ + --query "(region == 'North America') & (division != 'Washington')" \ + --min-date 2021-04-01 \ + --max-date 2021-06-01 \ + --exclude-ambiguous-dates-by any \ + --subsample-max-sequences 1000 \ + --output-strains strains_north-america.txt + +Select a random sample of recent data from the rest of the world. Here, we will randomly sample 1,000 strains collected between December 1, 2020 and June 1, 2021 from all continents except North America. To evenly sample all regions through time, we also group data by region, year, and month and sample evenly from these groups. + +.. code:: bash + + augur filter \ + --metadata data/metadata_gisaid.tsv.gz \ + --query "region != 'North America'" \ + --min-date 2020-12-01 \ + --max-date 2021-06-01 \ + --exclude-ambiguous-dates-by any \ + --subsample-max-sequences 1000 \ + --group-by region year month \ + --output-strains strains_global.txt + +Extract metadata and sequences for selected strains +--------------------------------------------------- + +Now that you've selected a subset of strains from the full GISAID database, extract the corresponding metadata and sequences to use as inputs for the Nextstrain workflow. + +.. code:: bash + + augur filter \ + --metadata data/metadata_gisaid.tsv.gz \ + --sequence-index data/sequence_index_gisaid.tsv.gz \ + --sequences data/sequences_gisaid.fasta.gz \ + --exclude-all \ + --include strains_washington.txt strains_north-america.txt strains_global.txt \ + --output-metadata data/subsampled_metadata_gisaid.tsv.gz \ + --output-sequences data/subsampled_sequences_gisaid.fasta.gz + +You can use these extracted files as inputs for the workflow. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: subsampled-gisaid + metadata: data/subsampled_metadata_gisaid.tsv.gz + sequences: data/subsampled_sequences_gisaid.fasta.gz diff --git a/docs/src/guides/data-prep/gisaid-search.rst b/docs/src/guides/data-prep/gisaid-search.rst new file mode 100644 index 000000000..690fe5e80 --- /dev/null +++ b/docs/src/guides/data-prep/gisaid-search.rst @@ -0,0 +1,209 @@ +Curate data from GISAID search and downloads +============================================ + +The following instructions describe how to curate data for a region-specific analysis (e.g., identifying recent introductions into Washington State) using GISAID's “Search” page and curated regional data from the “Downloads” window. Inferences about a sample's origin strongly depend on the composition of your dataset. For example, discrete trait analysis models cannot infer transmission from an origin that is not present in your data. We show how to overcome this issue by adding previously curated contextual sequences from Nextstrain to your region-specific dataset. + +.. contents:: Table of Contents + :local: + +Login to GISAID +--------------- + +Navigate to `GISAID (gisaid.org) `__ and select the “Login” link. + +.. figure:: ../../images/gisaid-homepage.png + :alt: GISAID homepage with login link + + GISAID homepage with login link + +Login to your GISAID account. If you do not have an account yet, register for one (it's free) by selecting the “Registration” link. + +.. figure:: ../../images/gisaid-login.png + :alt: GISAID login page with registration link + + GISAID login page with registration link + +Select “EpiCoV” from the top navigation bar. + +.. figure:: ../../images/gisaid-navigation-bar.png + :alt: GISAID navigation bar with “EpiCoV” link + + GISAID navigation bar with “EpiCoV” link + +Search for region-specific data +------------------------------- + +Select “Search” from the EpiCoV navigation bar. + +.. figure:: ../../images/gisaid-epicov-navigation-bar.png + :alt: GISAID EpiCoV navigation bar with “Search” link + + GISAID EpiCoV navigation bar with “Search” link + +Find the “Location” field and start typing “North America /”. As you type, the field will suggest more specific geographic scales. + +.. figure:: ../../images/gisaid-initial-search-interface.png + :alt: GISAID initial search interface + + GISAID initial search interface + +Finish by typing “North America / USA / Washington”. Select all strains collected between May 1 and June 1 with complete genome sequences and collection dates. Click the checkbox in the header row of the results display, to select all strains that match the search parameters. + +.. figure:: ../../images/gisaid-search-results.png + :alt: GISAID search results for “Washington” + + GISAID search results for “Washington” + +.. warning:: + + GISAID limits the number of records you can download at once to 5000. If you need to download more records, constrain your search results to smaller windows of time by collection date and download data in these smaller batches. + +Select the “Download” button in the bottom right of the search results. There are two options to download data from GISAID, both of which we describe below. + +Option 1: Download “Input for the Augur pipeline” +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +From the resulting “Download” window, select “Input for the Augur pipeline” as the download format. + +.. figure:: ../../images/gisaid-search-download-window.png + :alt: GISAID search download window showing “Input for the Augur pipeline” option + + GISAID search download window showing “Input for the Augur pipeline” option + +Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington.tar``. This tar archive contains compressed metadata and sequences named like ``1622567829294.metadata.tsv.xz`` and ``1622567829294.sequences.fasta.xz``, respectively. + +You can use this tar file as an input for the Nextstrain workflow, as shown below. The workflow will extract the data for you. Create a new workflow config file, in the top-level of the ``ncov`` directory that defines your analysis or “builds”. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: washington + # The workflow will detect and extract the metadata and sequences + # from GISAID tar archives. + metadata: data/gisaid_washington.tar + sequences: data/gisaid_washington.tar + +Next, you can move on to the heading below to get contextual data for your region of interest. Alternately, you can extract the tar file into the ``data/`` directory prior to analysis. + +.. code:: bash + + tar xvf data/gisaid_washington.tar + +Rename the extracted files to match the descriptive name of the original archive. + +.. code:: bash + + mv data/1622567829294.metadata.tsv.xz data/gisaid_washington_metadata.tsv.xz + mv data/1622567829294.sequences.fasta.xz data/gisaid_washington_sequences.fasta.xz + +You can use these extracted files as inputs for the workflow. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: washington + # The workflow also accepts compressed metadata and sequences + # from GISAID. + metadata: data/gisaid_washington_metadata.tsv.xz + sequences: data/gisaid_washington_sequences.fasta.xz + +Option 2: Download “Sequences” and “Patient status metadata” +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Alternately, you can download sequences and metadata as two separate uncompressed files. First, select “Sequences (FASTA)” as the download format. Check the box for replacing spaces with underscores. Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington_sequences.fasta``. + +.. figure:: ../../images/gisaid-search-download-window-sequences.png + :alt: GISAID search download window showing “Sequences (FASTA)” option + + GISAID search download window showing “Sequences (FASTA)” option + +From the search results interface, select the “Download” button in the bottom right again. Select “Patient status metadata” as the download format. Select the “Download” button and save the file to ``data/`` with a descriptive name like ``gisaid_washington_metadata.tsv``. + +.. figure:: ../../images/gisaid-search-download-window-metadata.png + :alt: GISAID search download window showing “Patient status metadata” option + + GISAID search download window showing “Patient status metadata” option + +You can use these files as inputs for the workflow like so. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: washington + metadata: data/gisaid_washington_metadata.tsv + sequences: data/gisaid_washington_sequences.fasta + +Download contextual data for your region of interest +---------------------------------------------------- + +Next, select the “Downloads” link from the EpiCoV navigation bar. + +.. figure:: ../../images/gisaid-epicov-navigation-bar-with-downloads.png + :alt: GISAID EpiCoV navigation bar with “Downloads” link + + GISAID EpiCoV navigation bar with “Downloads” link + +Scroll to the “Genomic epidemiology” section and select the “nextregions” button. + +.. figure:: ../../images/gisaid-downloads-window.png + :alt: GISAID downloads window + + GISAID downloads window + +Select the major region that corresponds to your region-specific data above (e.g., “North America”). + +.. figure:: ../../images/gisaid-nextregions-download-window.png + :alt: GISAID “nextregions” download window + + GISAID “nextregions” download window + +Agree to the terms and conditions and download the corresponding file (named like ``ncov_north-america.tar.gz``) to the ``data/`` directory. + +.. figure:: ../../images/gisaid-nextregions-download-terms-and-conditions.png + :alt: GISAID “nextregions” download terms and conditions + + GISAID “nextregions” download terms and conditions + +This compressed tar archive contains metadata and sequences corresponding to `a recent Nextstrain build for that region `__ with names like ``ncov_north-america.tsv`` and ``ncov_north-america.fasta``, respectively. For example, the “North America” download contains data from `Nextstrain's North America build `__. These regional Nextstrain builds contain data from a specific region and contextual data from all other regions in the world. By default, GISAID provides these “nextregions” data in the “Input for the Augur pipeline” format. + +As with the tar archive from the search results above, you can use the “nextregions” compressed tar archives as input to the Nextstrain workflow and the workflow will extract the appropriate contents for you. For example, you could update your ``inputs`` in the workflow config file from above to include the North American data as follows. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: washington + # The workflow will detect and extract the metadata and sequences + # from GISAID tar archives. + metadata: data/gisaid_washington.tar + sequences: data/gisaid_washington.tar + - name: north-america + # The workflow will similarly detect and extract metadata and + # sequences from compressed tar archives. + metadata: data/ncov_north-america.tar.gz + sequences: data/ncov_north-america.tar.gz + +Alternately, you can extract the data from the compressed tar archive into the ``data/`` directory. + +.. code:: bash + + tar zxvf data/ncov_north-america.tar.gz + +You can use these extracted files as inputs for the workflow. + +.. code:: yaml + + # Define inputs for the workflow. + inputs: + - name: washington + # The workflow will detect and extract the metadata and sequences + # from GISAID tar archives. + metadata: data/gisaid_washington.tar + sequences: data/gisaid_washington.tar + - name: north-america + # The workflow supports uncompressed or compressed input files. + metadata: data/ncov_north-america.tsv + sequences: data/ncov_north-america.fasta diff --git a/docs/src/guides/data-prep/index.rst b/docs/src/guides/data-prep/index.rst new file mode 100644 index 000000000..49769def2 --- /dev/null +++ b/docs/src/guides/data-prep/index.rst @@ -0,0 +1,22 @@ +*************** +Data Prep Guide +*************** + +Preparing your data +=================== + +To use Nextstrain to analyze your own data, you'll need to prepare two files: + +1. A ``fasta`` file with viral genomic sequences +2. A corresponding ``tsv`` file with metadata describing each sequence + +We describe the following ways to prepare data for a SARS-CoV-2 analysis: + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + local-data + gisaid-search + gisaid-full + diff --git a/docs/src/guides/data-prep/local-data.rst b/docs/src/guides/data-prep/local-data.rst new file mode 100644 index 000000000..f74b1771f --- /dev/null +++ b/docs/src/guides/data-prep/local-data.rst @@ -0,0 +1,84 @@ +Prepare your own local data +=========================== + +Prepare your own local data for quality control prior to submission to a public database. + +.. contents:: Table of Contents + :local: + +Formatting your sequence data +----------------------------- + +The first 2 lines in ``data/sequences.fasta`` look like this: + +:: + + >Wuhan-Hu-1/2019 + ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATC..... + +**The first line is the ``strain`` or ``name`` of the sequence.** Lines with names in FASTA files always start with the ``>`` character (this is not part of the name), and may not contain spaces or ``()[]{}|#><``. Note that “strain” here carries no biological or functional significance and should largely be thought of as synonymous with “sample.” + +The sequence itself is a `consensus genome `__. + +**By default, sequences less than 27,000 bases in length or with more than 3,000 ``N`` (unknown) bases are omitted from the analysis.** For a basic QC and preliminary analysis of your sequence data, you can use `clades.nextstrain.org `__. This tool will check your sequences for excess divergence, clustered differences from the reference, and missing or ambiguous data. In addition, it will assign nextstrain clades and call mutations relative to the reference. + +Formatting your metadata +------------------------ + +Nextstrain accommodates many kinds of metadata, so long as it is in a ``TSV`` format. A ``TSV`` is a text file, where each row (line) represents a sample and each column (separated by tabs) represents a field. + +.. note:: + + If you're unfamiliar with TSV files, don't fret; it's straightforward to export these directly from Excel, which we'll cover shortly. + +Here's an example of the first few columns of the metadata for a single strain, including the header row. *(Spacing between columns here is adjusted for clarity, and only the first 6 columns are shown).* + +:: + + strain virus gisaid_epi_isl genbank_accession date region ... + NewZealand/01/2020 ncov EPI_ISL_413490 ? 2020-02-27 Oceania ... + +:doc:`See the reference guide on metadata fields for more details <../../reference/metadata-fields>`. + +Required metadata +~~~~~~~~~~~~~~~~~ + +A valid metadata file must include the following fields: + ++------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| Field | Example value | Description | Formatting | ++========================+=======================================================================================+=============================+===============================================================================================================================+ +| ``strain`` or ``name`` | ``Australia/NSW01/2020`` | Sample name / ID | Each header in the fasta file must exactly match a ``strain`` value in the metadata. Characters ``()[]{}|#><`` are disallowed | ++------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| ``date`` | ``2020-02-27``, ``2020-02-XX``, ``2020-XX-XX`` | Date of *sampling* | ``YYYY-MM-DD``; ambiguities can be indicated with ``XX`` | ++------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| ``virus`` | ``ncov`` | Pathogen name | Needs to be consistent | ++------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| ``region`` | ``Africa``, ``Asia``, ``Europe``, ``North America``, ``Oceania`` or ``South America`` | Global region of *sampling* | | ++------------------------+---------------------------------------------------------------------------------------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + +Please be aware that **our current workflow will filter out any genomes with an unknown date - you can change this in your own workflow.** + +Missing metadata +~~~~~~~~~~~~~~~~ + +Missing data is to be expected for certain fields. In general, **missing data is represented by an empty string or a question mark character.** There is one important difference: if a discrete trait reconstruction (e.g. via ``augur traits``) is to be run on this column, then a value of ``?`` will be inferred, whereas the empty string will be treated as missing data in the output. See below for how to represent uncertainty in sample collection date. + +General formatting tips +~~~~~~~~~~~~~~~~~~~~~~~ + +- **The order of the fields doesn't matter**; but if you are going to join your metadata with the global collection then it's easiest to keep them in the same order! +- **Not all fields are currently used**, but this may change in the future. +- Data is **case sensitive**. +- The **“geographic” columns, such as “region” and “country” will be used to plot the samples on the map**. Adding a new value to these columns isn't a problem at all, but there are a few extra steps to take; see :doc:`../workflow-config-file`. +- **You can color by any of these fields in the Auspice visualization**. Which exact columns are used, and which colors are used for each value is completely customizable; see :doc:`../customizing-visualization`. + +Formatting metadata in Excel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can also create a TSV file in Excel. However, due to issues with auto-formatting of certain fields in Excel (like dates), we don't recommend this as a first option. If you do edit a file in Excel, open it afterwards in a text editor to check it looks as it should. + +1. Create a spreadsheet where each row is a sample, and each column is a metadata field +2. Ensure your spreadsheet meets the requirements outlined above. Pay special attention to date formats; see `this guide to date formatting in Excel `__. +3. Click on ``File > Save as`` +4. Choose ``Text (Tab delimited) (*.txt)`` and enter a filename ending in ``.tsv`` diff --git a/docs/src/guides/index.rst b/docs/src/guides/index.rst deleted file mode 100644 index e56e74e5b..000000000 --- a/docs/src/guides/index.rst +++ /dev/null @@ -1,11 +0,0 @@ -****************** -Guides -****************** - -.. toctree:: - :maxdepth: 1 - :titlesonly: - :caption: Table of contents - - run-analysis-on-terra - data-prep diff --git a/docs/src/guides/update-workflow.rst b/docs/src/guides/update-workflow.rst new file mode 100644 index 000000000..27365acbe --- /dev/null +++ b/docs/src/guides/update-workflow.rst @@ -0,0 +1,35 @@ +Update the workflow +=================== + +We update the official workflow regularly with: + +- `curated metadata including latitudes/longitudes, clade annotations, and low quality sequences `__ +- bug fixes +- `new features <../reference/change_log>`__ + +Update your local copy of the workflow, to benefit from these changes. + +.. code:: bash + + # Download and apply changes from the Nextstrain team. + # This only works if there is no conflict with your local repository. + git pull --ff-only origin master + + # OR: + + # Alternately, download and apply changes from the Nextstrain team + # and then replay your local changes on top of those incoming changes. + git pull --rebase origin master + +Alternately, download a specific version of the workflow that you know works for you. We create new `releases of the workflow `__ any time we introduce breaking changes, so you can choose when to update based on `what has changed <../reference/change_log>`__. + +.. code:: bash + + # Download version 7 (v7) of the workflow. + curl -OL https://github.com/nextstrain/ncov/archive/refs/tags/v7.zip + + # Uncompress the workflow. + unzip v7.zip + + # Change into the workflow's directory. + cd ncov-7/ diff --git a/docs/src/reference/customizing-analysis.rst b/docs/src/guides/workflow-config-file.rst similarity index 56% rename from docs/src/reference/customizing-analysis.rst rename to docs/src/guides/workflow-config-file.rst index 78711b68b..4caa0cffb 100644 --- a/docs/src/reference/customizing-analysis.rst +++ b/docs/src/guides/workflow-config-file.rst @@ -1,40 +1,50 @@ -Customizing analysis -==================== +Workflow config file guide +========================== + +This is a guide for common use cases of a :term:`workflow config file `. For a detailed reference, see :doc:`../reference/workflow-config-file`. + +.. contents:: Table of Contents + :local: Changing parameters ------------------- -You can configure most steps of `the workflow `__ by specifying values in a ``.yaml`` configuration file. We’ve provided reasonable default values for each step in the ``defaults/parameters.yaml``; these are the same values the Nextstrain team uses for our analyses. For more details, `see the reference for all workflow configuration parameters `__. +You can configure most steps of the workflow by specifying values in a workflow config file. We've provided reasonable defaults in ``defaults/parameters.yaml``; these are the same values the Nextstrain team uses for our analyses. -We encourage you to take a few minutes to **skim through**\ `the default config file `__\ **. Although these default values should be fine for most users, it’s helpful to get a sense for what options are available.** +We encourage you to take a few minutes to skim through `the default workflow config file `__. Although these default values should be fine for most users, it's helpful to get a sense for what options are available. -If you’d like to tweak the parameterization, **you can override any of these values by specifying them in the ``my_profiles//builds.yaml`` file. Any values not overridden in this way will fall back to the default values.** Keeping build-specific parameters separate this way prevents mixups of settings between runs, and gives you a cleaner file to work with (rather than having to wrestle the *entire* default parameterization file). +If you'd like to tweak the parameterization, **you can override any of these values by specifying them in the workflow config file. Any values not overridden in this way will fall back to the default values.** Keeping build-specific parameters separate this way prevents mixups of settings between runs, and gives you a cleaner file to work with (rather than having to wrestle the *entire* default workflow config file). Adding custom rules ------------------- -Insert your own custom Snakemake rules into the default workflow without modifying the main Snakemake files, by defining a list of ``custom_rules`` in your ``builds.yaml`` file. Each entry in the ``custom_rules`` list should be a path to a valid Snakemake file (e.g., “my_rules.smk”). The main workflow will detect these custom rules and include them after all other rules have been defined. +Insert your own custom Snakemake rules into the workflow without modifying the existing Snakemake files, by defining a list of ``custom_rules`` in your workflow config file. Each entry in the ``custom_rules`` list should be a path to a valid Snakemake file (e.g., ``my_rules.smk``). The workflow will detect these custom rules and include them after all other rules have been defined. -As an example, the Nextstrain team’s workflow defines custom export rules that modify the default auspice JSONs. These rules are defined in the ``builds.yaml`` file as follows: +To modify rules, create a new :term:`customization file` named ``my_rules.smk`` and add the ``custom_rules`` section in a workflow config file: .. code:: yaml custom_rules: - - workflow/snakemake_rules/export_for_nextstrain.smk + - my-ncov-analyses/my_rules.smk -To modify rules for the example profile, create a new file named ``my_profiles/example/my_rules.smk`` and modify the ``builds.yaml`` file for the example profile to include the following lines: +As an example, the Nextstrain team's workflow defines custom export rules that modify the default Auspice JSONs. These rules are defined in the workflow config file as follows: .. code:: yaml custom_rules: - - my_profiles/example/my_rules.smk + - workflow/snakemake_rules/export_for_nextstrain.smk Adding a new place ------------------ -Places are defined as one of: - ``region`` (e.g., ``North America``, ``Asia``) - ``country`` - ``division`` (i.e., state, province, or canton) - ``location`` (i.e., a county or city within a division) +Places are defined as one of: + +- ``region`` (e.g., ``North America``, ``Asia``) +- ``country`` +- ``division`` (i.e., state, province, or canton) +- ``location`` (i.e., a county or city within a division) -To define a new place, you’ll need to specify its GPS coordinates and a color. +To define a new place, you'll need to specify its GPS coordinates and a color. 1. Add a line to ``defaults/lat_longs.tsv``. This file is separated into sections for each geographic resolution. This looks like: @@ -45,7 +55,7 @@ To define a new place, you’ll need to specify its GPS coordinates and a color. .. - Note: keep in mind that ``0.0`` longitude is the prime meridian; to specify something in the Western hemisphere, you’ll need to enter a *negative* value for longitude. Similarly, to specify something in the Southern hemisphere, you’ll need to enter a *negative* value for latitude + Note: keep in mind that ``0.0`` longitude is the prime meridian; to specify something in the Western hemisphere, you'll need to enter a *negative* value for longitude. Similarly, to specify something in the Southern hemisphere, you'll need to enter a *negative* value for latitude 2. Add an entry to ``color_ordering.tsv`` such that your newly-defined place is next to geographically nearby places in the list. @@ -55,12 +65,12 @@ Subsampling Basic subsampling ~~~~~~~~~~~~~~~~~ -Reasonable defaults are pre-defined. You can find a `description of them here <../tutorial/running.md>`__. +Reasonable defaults are pre-defined. You can find a :ref:`list of them here `. Custom subsampling schemes ~~~~~~~~~~~~~~~~~~~~~~~~~~ -We implement hierarchical subsampling by producing multiple samples at different geographic scales and merge these samples into one file for further analysis. A build can specify any number of such samples which can be flexibly restricted to particular meta data fields and subsampled from groups with particular properties. When specifying subsampling in this way, we’ll first take sequences from the ‘focal’ area, and the select samples from other geographical areas. Read further for information on how we select these samples. Here, we’ll look at the advanced example (``./my_profiles/example_advanced_customization``) file to explain some of the options. +We implement hierarchical subsampling by producing multiple samples at different geographic scales and merge these samples into one file for further analysis. A build can specify any number of such samples which can be flexibly restricted to particular meta data fields and subsampled from groups with particular properties. When specifying subsampling in this way, we'll first take sequences from the 'focal' area, and the select samples from other geographical areas. Read further for information on how we select these samples. Here, we'll look at `the advanced example config file `__ to explain some of the options. When specifying how many sequences you want in a subsampling level (for example, from a country or a region), you can do this using either ``seq_per_group`` or ``max_sequences`` - these work with the ``group_by`` argument. For example, ``switzerland`` subsampling rules in the advanced example looks like this: @@ -94,7 +104,7 @@ For ``country``-level sampling above, we specify that we want a maximum of 1,500 Alternatively, in the ``region``-level sampling, we set ``seq_per_group`` to 20. This means that all the sequences from Europe (excluding Switzerland) will be divided into groups by their sampling country, month, and year (as defined by ``group_by``), and then 20 sequences will taken from each group (if there are fewer than 20 in any given group, all of the samples from that group will be taken). -Now we’ll look at a subsampling scheme which defines a multi-``canton`` build. Cantons are regional divisions in Switzerland - below ‘country,’ but above ‘location’ (often city-level). In the advanced example, we’d like to be able to specify a set of neighboring ‘cantons’ and do focal sampling there, with contextual samples from elsewhere in the country, other countries in the region, and other regions in the world. +Now we'll look at a subsampling scheme which defines a multi-``canton`` build. Cantons are regional divisions in Switzerland - below 'country,' but above 'location' (often city-level). In the advanced example, we'd like to be able to specify a set of neighboring 'cantons' and do focal sampling there, with contextual samples from elsewhere in the country, other countries in the region, and other regions in the world. For cantons this looks like this: @@ -141,9 +151,9 @@ For cantons this looks like this: type: "proximity" focus: "country" -All entries above canton level (the ‘contextual’ samples) specify priorities. Currently, we have only implemented one type of priority called ``proximity``. It attempts to selected sequences as close as possible to the focal samples specified as ``focus: division``. The argument of the latter has to match the name of one of the other subsamples. +All entries above canton level (the 'contextual' samples) specify priorities. Currently, we have only implemented one type of priority called ``proximity``. It attempts to selected sequences as close as possible to the focal samples specified as ``focus: division``. The argument of the latter has to match the name of one of the other subsamples. -In addition to the ``exclude`` filter, you can also specify strains to keep by providing a ``query``. The ``query`` field uses augur filter’s ``--query`` argument (introduced in version 8.0.0) and supports `pandas-style logical operators `__. For example, the following exclusionary filter, +In addition to the ``exclude`` filter, you can also specify strains to keep by providing a ``query``. The ``query`` field uses augur filter's ``--query`` argument (introduced in version 8.0.0) and supports `pandas-style logical operators `__. For example, the following exclusionary filter, .. code:: yaml @@ -155,12 +165,12 @@ can also be written as an inclusionary filter like so: query: --query "(region == {region}) & (country == {country}) & (division == '{division}')" -If you need parameters in a way that isn’t represented by the configuration file, `create a new issue in the ncov repository `__ to let us know. +If you need parameters in a way that isn't represented by the config file, `create a new issue in the ncov repository `__ to let us know. Ancestral trait reconstruction ------------------------------ -Trait reconstruction is the process by which augur infers the most likely metadata value of an internal node. For example, if an internal node (which always represents a hypothesized, ancestral virus / case) has 3 descendants, all of which were isolated in Washington State, we might infer that the ancestor was most likely also circulating in Washington State (see `“Interpretation” <../visualization/interpretation.md>`__ for more). +Trait reconstruction is the process by which augur infers the most likely metadata value of an internal node. For example, if an internal node (which always represents a hypothesized, ancestral virus / case) has 3 descendants, all of which were isolated in Washington State, we might infer that the ancestor was most likely also circulating in Washington State (see :doc:`../visualization/interpretation` for more). For each build, you can specify which categorical metadata fields to use for trait reconstruction. @@ -168,7 +178,7 @@ For each build, you can specify which categorical metadata fields to use for tra -To specify this on a per-build basis, add a block like the following to your ``my_profiles//builds.yaml`` file: +To specify this on a per-build basis, add a block like the following to your workflow config file: .. code:: yaml @@ -180,11 +190,11 @@ To specify this on a per-build basis, add a block like the following to your ``m Labeling clades --------------- -We assign clade labels according to `this schema <../reference/naming_clades.md>`__. +We assign clade labels according to :doc:`this schema <../reference/naming_clades>`. Because the exact topology of the tree will vary across runs, clades are defined based on their unique mutations. These are specified in ``defaults/clades.tsv`` like so: -.. code:: tsv +:: # clade gene site alt diff --git a/docs/src/images/basic_snakemake_build.png b/docs/src/images/basic_nextstrain_build.png similarity index 100% rename from docs/src/images/basic_snakemake_build.png rename to docs/src/images/basic_nextstrain_build.png diff --git a/docs/src/index.rst b/docs/src/index.rst index dc1ec5edc..1e48f0219 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -28,12 +28,44 @@ If you have a specific question, post a note over at the `discussion board diff --git a/docs/src/reference/customizing-visualization.rst b/docs/src/reference/customizing-visualization.rst deleted file mode 100644 index 07086f40b..000000000 --- a/docs/src/reference/customizing-visualization.rst +++ /dev/null @@ -1,111 +0,0 @@ -Customizing your Auspice visualization -====================================== - -Just as we can specify a build-specific analysis options in the ``builds.yaml`` file, we can also specify build-specific visualization options in this directory. - -Looking at the ``builds.yaml`` file, the last few lines are: - -.. code:: yaml - - files: - auspice_config: "my_profiles/example/my_auspice_config.json" - -This points to a JSON file that parameterizes the output files used for visualizion with Auspice. Let’s look at what kinds of customization options we can use this for. - -Custom color schemes --------------------- - -If you’d like to specify a custom color scale, you can add a ``colors.tsv`` file, where each line is a tab-delimited list of a metadata column name; a metadata value; and a corresponding hex code. - -The first few lines of the example file look like this: - -:: - - country Russia #5E1D9D - country Serbia #4D22AD - country Europe #4530BB - ... - -Make sure to also add - -.. code:: yaml - - files: - colors: "my_profiles//colors.tsv" - -to your ``builds.yaml`` file. - -Changing the dataset description --------------------------------- - -The dataset description, which appears below the visualizations, is read from a file which is specified in ``builds.yaml``. Per-build description can be set by specifying them in the build. - -.. code:: yaml - - builds: - north-america: # name of the build; this can be anything - description: my_profiles/example/north-america-description.md - -If that is not provided, then a per-run description is used, also specified in ``builds.yaml``: - -.. code:: yaml - - files: - description: my_profiles/example/my_description.md - -Adding custom metadata fields to color by ------------------------------------------ - -1. Add a `valid metadata column <../guides/data-prep.md>`__ to your ``metadata.tsv`` -2. Open ``my_profiles//auspice_config.json`` -3. Add an entry to the ``colorings`` block of this JSON: - -.. code:: json - - ... - "colorings": [ - { - "key": "location", - "title": "Location", - "type": "categorical" - }, - { - "key": "metadata_column_name", - "title": "Display name for interface", - "type": "categorical" or "continuous" - } - ... - ] - ... - -Choosing defaults ------------------ - -You can specify the default view in the ``display_defaults`` block of an ``auspice_config.json`` file (see above) - -.. code:: json - - ... - "display_defaults": { - "color_by": "division", - "distance_measure": "num_date", - "geo_resolution": "division", - "map_triplicate": true, - "branch_label": "none" - }, - ... - -Choosing panels to display --------------------------- - -Similarly, you can choose which panels to enable in the ``panels`` block: - -.. code:: json - - ... - "panels": [ - "tree", - "map", - "entropy" - ] - ... diff --git a/docs/src/reference/data_submitter_faq.rst b/docs/src/reference/data_submitter_faq.rst index 77b575769..105eee512 100644 --- a/docs/src/reference/data_submitter_faq.rst +++ b/docs/src/reference/data_submitter_faq.rst @@ -1,40 +1,40 @@ -Data Submitter’s FAQ --------------------- +Data Submitter's FAQ +==================== -We often recieve questions from data submittors about why their data is not visible on the `Nextstrain SARS-CoV-2 runs `__. This short FAQ highlights some of the main reasons why data may not be showing up on Nextstrain. +We often receive questions from data submitters about why their data is not visible on the `Nextstrain SARS-CoV-2 runs `__. This short FAQ highlights some of the main reasons why data may not be showing up on Nextstrain. -Sequence Length & Number of N’s -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Sequence Length & Number of N's +------------------------------- -We currently only use full-genome sequences which are at least 27,000 bases in length. They also cannot have more than 3,000 bases that are ‘N’. +We currently only use full-genome sequences which are at least 27,000 bases in length. They also cannot have more than 3,000 bases that are 'N'. Subsampling -~~~~~~~~~~~ +----------- -Nextstrain runs can be subsampled considerably. There are over >30,000 whole-genome sequences available on GISAID currently, but we typically include <5,000 in each of our runs. If the division your samples are from contains more than about 100 samples per month, they are likely to be downsampled. Be sure to check the appropriate regional build - these are sampled more heavily from the focal region, so there’s a higher chance a sequence will be included in the run. We have regional builds for `North America `__, `South America `__, `Asia `__, `Africa `__, `Europe `__, and `Oceania `__. +Nextstrain runs can be subsampled considerably. There are over >30,000 whole-genome sequences available on GISAID currently, but we typically include <5,000 in each of our runs. If the division your samples are from contains more than about 100 samples per month, they are likely to be downsampled. Be sure to check the appropriate regional build - these are sampled more heavily from the focal region, so there's a higher chance a sequence will be included in the run. We have regional builds for `North America `__, `South America `__, `Asia `__, `Africa `__, `Europe `__, and `Oceania `__. Missing Dates -~~~~~~~~~~~~~ +------------- We currently only include samples that have an **exact sampling date** (day, month, year). This is because we cannot accurately estimate the sample dates from the sequences at the moment, given the short duration of the pandemic so far, and the mutation rate. -If your sample has only year or only month and year as a sampling date, it will be automatically excluded from runs. If you have privacy/data sharing concerns, it’s ok to slightly change the collection date randomly by +/- 1 or 2 days. Please do *not* use the sequencing or processing date, as these can negatively influence our runs. +If your sample has only year or only month and year as a sampling date, it will be automatically excluded from runs. If you have privacy/data sharing concerns, it's ok to slightly change the collection date randomly by +/- 1 or 2 days. Please do *not* use the sequencing or processing date, as these can negatively influence our runs. If you wish to add a corrected date to your samples, simply updating the sampling date in GISAID will automatically update our system, and the sequence will be included in the next run! Many Samples with the Same Date -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------- -If we receive many samples that have identical dates as sample dates, we may exclude these manually. This is because this often indicates that the ‘sample date’ given is not actually the sample date, but the sequencing, processing, or uploading date. We try to email submitters when we do this to check whether the dates are truly the collection dates. +If we receive many samples that have identical dates as sample dates, we may exclude these manually. This is because this often indicates that the 'sample date' given is not actually the sample date, but the sequencing, processing, or uploading date. We try to email submitters when we do this to check whether the dates are truly the collection dates. If you are genuinely submitting many sequences with identical dates, you can avoid us temporarily excluding them by emailing hello@nextstrain.org to let us know about the sequences and why they have the same date (ex: collected during investigation of a long-term care center). Missing USA State -~~~~~~~~~~~~~~~~~ +----------------- -We currently exclude samples from the USA which do not have a ‘division’ attribute (this is the USA state or territory where they were sampled). Adding a state/territory/division to your sample on GISAID will automatically update this on our system, and the sequence will appear in our next run. +We currently exclude samples from the USA which do not have a 'division' attribute (this is the USA state or territory where they were sampled). Adding a state/territory/division to your sample on GISAID will automatically update this on our system, and the sequence will appear in our next run. Divergence Issues -~~~~~~~~~~~~~~~~~ +----------------- For quality control, we use a combination of automated and manual checks to ensure that sequences included seem to be free of sequencing and/or assembly error. If a sequenece is deemed to be far too divergent (has more mutations than we expect given the sampling date), or far too under-diverged (has far fewer mutations than we expect given the sampling date), it may be excluded. We cannot off direct help in these cases, but suggest you revisit the raw sequence files with the aid of someone with experience using your sequencing pipeline, in order to correct any sequencing and assembly errors. diff --git a/docs/src/reference/files.rst b/docs/src/reference/files.rst new file mode 100644 index 000000000..d276afb99 --- /dev/null +++ b/docs/src/reference/files.rst @@ -0,0 +1,78 @@ +Files overview +============== + +This page gives an overview of the files in your local ``ncov/`` directory. + +.. contents:: + :local: + +User files +---------- + +User files are not tracked by version control, meaning they are either provided by the user or generated by the workflow. + +Analysis directory +~~~~~~~~~~~~~~~~~~ + +An :term:`analysis directory` is a non-tracked directory which contains user-defined :term:`customization files `. + +In the :doc:`tutorials <../tutorial/intro>`, the analysis directory is ``ncov-tutorial/``. Follow :ref:`these steps ` to create your own analysis directory. + +.. hint:: + + Previously, we recommended using Snakemake profiles under a ``my_profiles/`` analysis directory. We now recommend using Snakemake config files directly via the ``--configfile`` parameter. You can still use existing profiles via ``--configfile my_profiles//builds.yaml``. + +Input files +~~~~~~~~~~~ + +Learn how to prepare input files with :doc:`../guides/data-prep/index`. + +.. note:: + + A few example input files are provided when you clone ``ncov/`` locally, under ``data/``. + +- Metadata file (e.g. ``data/example_metadata.tsv``): tab-delimited description of strain (i.e., sample) attributes +- Sequences file (e.g. ``data/example_sequences.fasta.gz``): genomic sequences whose ids must match the ``strain`` column in the metadata file. + +Output files and directories +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These are generated by the workflow. + +- ``auspice/.json``: output file for visualization in Auspice where ```` is the name of your output dataset in the workflow configuration file used by ``--configfile``. +- ``results/aligned.fasta``, etc.: raw results files (dependencies) that are shared across all datasets. +- ``results//``: raw results files (dependencies) that are specific to a single dataset. +- ``logs/``: Log files with error messages and other information about the run. +- ``benchmarks/``: Run-times (and memory usage on Linux systems) for each rule in the workflow. + +Internal files +-------------- + +These files are not intended for modification. See :doc:`../guides/workflow-config-file` on how to configure workflow behavior. + +Default workflow customization files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- ``defaults/parameters.yaml``: default :term:`config file`. Override these settings using ``--configfile your-config.yaml``. +- ``defaults/auspice_config.json``: default :term:`Auspice config file`. Override these settings using ``auspice_config``. +- ``defaults/include.txt``: default strain names to *include* during subsampling and filtering. +- ``defaults/exclude.txt``: default strain names to *exclude* during subsampling and filtering. + +Workflow definition files +~~~~~~~~~~~~~~~~~~~~~~~~~ + +- ``Snakefile``: entry point for Snakemake commands that also validates inputs. +- ``workflow/snakemake_rules/main_workflow.smk``: defines rules for running each step in the analysis. Modify your workflow config file, rather than hardcode changes into the snakemake file itself. +- ``workflow/envs/nextstrain.yaml``: specifies computing environment needed to run workflow with the ``--use-conda`` flag. +- ``workflow/schemas/config.schema.yaml``: defines format (e.g., required fields and types) for workflow config files. +- ``scripts/``: helper scripts for common tasks. + +Documentation +~~~~~~~~~~~~~ + +These files are used to generate the `workflow documentation `__. + +Nextstrain user files +~~~~~~~~~~~~~~~~~~~~~ + +The Nextstrain team maintains user files in the ``ncov/`` repo, under ``nextstrain_profiles/``. diff --git a/docs/src/reference/glossary.rst b/docs/src/reference/glossary.rst new file mode 100644 index 000000000..8141de55c --- /dev/null +++ b/docs/src/reference/glossary.rst @@ -0,0 +1,38 @@ +======== +Glossary +======== + +.. glossary:: + + analysis directory + + The folder within ``ncov/`` where :term:`customization files ` live. Previously this was ``my_profiles/`` but we now allow any name of choice, and provide `ncov-tutorial `__ as a starter template. + + Auspice config file + also ``auspice_config.json`` + + A JSON file used to configure visualization in :term:`docs.nextstrain.org:Auspice`. + + config file + also *workflow configuration file*, ``builds.yaml`` + + A YAML file used for :term:`Snakemake`'s ``--configfile`` parameter. Appends to and overrides default configuration in ``defaults/parameters.yaml``. + + customization file + + A file used to customize the :term:`ncov workflow`. + + Examples: :term:`Auspice config file`, :term:`workflow config file`, :term:`default files` + + default files + + Default :term:`customization files ` provided in ``ncov/defaults/``. + + ncov workflow + also *SARS-CoV-2 workflow* + + The workflow used to automate execution of :term:`builds`. Implemented in :term:`Snakemake`. + + Snakemake + + The workflow manager used in the :term:`ncov workflow`. diff --git a/docs/src/reference/index.rst b/docs/src/reference/index.rst deleted file mode 100644 index 90978c481..000000000 --- a/docs/src/reference/index.rst +++ /dev/null @@ -1,20 +0,0 @@ -****************** -Reference material -****************** - -.. toctree:: - :maxdepth: 1 - :titlesonly: - :caption: Table of contents - - multiple_inputs - remote_inputs - configuration - metadata-fields - naming_clades - data_submitter_faq - change_log - orientation-workflow - orientation-files - customizing-analysis - customizing-visualization diff --git a/docs/src/reference/metadata-fields.rst b/docs/src/reference/metadata-fields.rst index ced0c4f45..096b1bd74 100644 --- a/docs/src/reference/metadata-fields.rst +++ b/docs/src/reference/metadata-fields.rst @@ -1,108 +1,140 @@ -Standard Nextstrain metadata fields -=================================== +Standard metadata fields +======================== -**Column 1: ``strain``** +.. contents:: Table of Contents + :local: -This needs to match the name of a sequence in the FASTA file exactly and must not contain characters such as spaces, or ``()[]{}|#><``. In our example we have a strain called “NewZealand/01/2020” so there should be a sequence in the FASTA file for “>NewZealand/01/2020” (sequence names in FASTA files always start with the ``>`` character, but this is not part of the name). +Column 1: ``strain`` +-------------------------------------- -**Note that “strain” here carries no biological or functional significance** and should be thought of as synonymous with sample. +This needs to match the name of a sequence in the FASTA file exactly and must not contain characters such as spaces, or ``()[]{}|#><``. In our example we have a strain called ``NewZealand/01/2020`` so there should be a sequence in the FASTA file for ``>NewZealand/01/2020`` (sequence names in FASTA files always start with the ``>`` character, but this is not part of the name). -**Column 2: ``virus``** +.. note:: + + **"Strain" here carries no biological or functional significance** and should be thought of as synonymous with sample. + +Column 2: ``virus`` +-------------------------------------- Name of the pathogen. -**Column 3: ``gisaid_epi_isl``** +Column 3: ``gisaid_epi_isl`` +-------------------------------------- + +If this genome is shared via `GISAID `__ then please include the EPI ISL here. In our example this is ``EPI_ISL_413490``. -If this genome is shared via `GISAID `__ then please include the EPI ISL here. In our example this is “EPI_ISL_413490”. +Column 4: ``genbank_accession`` +-------------------------------------- -**Column 4: ``genbank_accession``** +If this genome is shared via `GenBank `__ then please include the accession number here. In our example this is ``?`` indicating that it hasn't (yet) been deposited in GenBank. (See above for more information on how to encode missing data.) -If this genome is shared via `GenBank `__ then please include the accession number here. In our example this is “?” indicating that it hasn’t (yet) been deposited in GenBank. (See above for more information on how to encode missing data.) +.. _metadata-column-date: -**Column 5: ``date``** (really important!) +Column 5: ``date`` (really important!) +-------------------------------------- -This describes the sample collection date (*not* sequencing date!) and must be formated according as ``YYYY-MM-DD``. Our example was collected on Feb 27, 2020 and is therefore represented as “2020-02-27”. +This describes the sample collection date (*not* sequencing date!) and must be formatted according as ``YYYY-MM-DD``. Our example was collected on Feb 27, 2020 and is therefore represented as ``2020-02-27``. You can specify unknown dates or month by replacing the respected values by ``XX`` (ex: ``2013-01-XX`` or ``2011-XX-XX``) and completely unknown dates can be shown with ``20XX-XX-XX`` (which does not restrict the sequence to being in the 21st century - they could be earlier). Please be aware that our current pipeline will filter out any genomes with an unknown date, however you can change this for your pipeline! -See `this guide `__ to formatting dates in Excel +See `this guide `__ to formatting dates in Excel. -**Column 6: ``region``** +Column 6: ``region`` +-------------------------------------- -The region the sample was collected in – for our example this is “Oceania”. Please use either “Africa”, “Asia”, “Europe”, “North America”, “Oceania” or “South America”. If you sequence a genome from Antartica, please get in touch! +The region the sample was collected in - for our example this is ``Oceania``. Please use either ``Africa``, ``Asia``, ``Europe``, ``North America``, ``Oceania`` or ``South America``. If you sequence a genome from Antarctica, please get in touch! -**Column 7: ``country``** +Column 7: ``country`` +-------------------------------------- -The country the sample was collected in. Our example, “NewZealand/01/2020”, was collected in ……. New Zealand. You can run ``tail +2 data/metadata.tsv | cut -f 7 | sort | uniq`` to see all the countries currently present in the metadata. As of April 10, there were 64! 🌎 +The country the sample was collected in. Our example, ``NewZealand/01/2020``, was collected in ……. New Zealand. You can run ``tail +2 data/metadata.tsv | cut -f 7 | sort | uniq`` to see all the countries currently present in the metadata. As of April 10, 2020, there were 64! 🌎 -**Column 8: ``division``** +Column 8: ``division`` +-------------------------------------- -Division currently doesn’t have a precise definition and we use it differently for different regions. For instance for samples in the USA, division is the state in which the sample was collected here. For other countries, it might be a county, region, or other administrative sub-division. To see the divisions which are currently set for your country, you can run the following command (replace “New Zealand” with your country): +Division currently doesn't have a precise definition and we use it differently for different regions. For instance for samples in the USA, division is the state in which the sample was collected here. For other countries, it might be a county, region, or other administrative sub-division. To see the divisions which are currently set for your country, you can run the following command (replace ``New Zealand`` with your country): .. code:: bash tail +2 data/metadata.tsv | cut -f 7,8 | grep "^New Zealand" | cut -f 2 | sort | uniq -**Column 9: ``location``** +Column 9: ``location`` +-------------------------------------- -Similarly to ``division``, but for a smaller geographic resolution. This data is often unavailable, and missing data here is typically represented by an empty field or the same value as ``division`` is used. In our example the division is “Auckland”, which conveniently (or confusingly) is both a province of New Zealand and a city. +Similarly to ``division``, but for a smaller geographic resolution. This data is often unavailable, and missing data here is typically represented by an empty field or the same value as ``division`` is used. In our example the division is "Auckland", which conveniently (or confusingly) is both a province of New Zealand and a city. - NOTE for columns 10-12 (“exposure”). These are no longer used in the analysis pipeline, and may no longer be kept up to date in our curated metadata. They remain here as they may be useful for certain questions. +.. note:: + + Columns 10-12 (``*_exposure``) are no longer used in the analysis pipeline, and may no longer be kept up to date in our curated metadata. They remain here as they may be useful for certain questions. -**Column 10: ``region_exposure``** +Column 10: ``region_exposure`` +-------------------------------------- -If the sample has a known travel history and infection is thought to have occured in this location, then represent this here. In our example, which represents New Zealand’s first known case, the patient had recently arrived from Iran, thus the value here is “Asia”. Specifying these travel histories helps inform the model we use to reconstruct the geographical movements of the virus. +If the sample has a known travel history and infection is thought to have occured in this location, then represent this here. In our example, which represents New Zealand's first known case, the patient had recently arrived from Iran, thus the value here is "Asia". Specifying these travel histories helps inform the model we use to reconstruct the geographical movements of the virus. If there is no travel history then set this to be the same value as ``region``. -**Column 11: ``country_exposure``** +Column 11: ``country_exposure`` +-------------------------------------- -Analogous to ``region_exposure`` but for ``country``. In our example, given the patient’s travel history, this is set to “Iran”. +Analogous to ``region_exposure`` but for ``country``. In our example, given the patient's travel history, this is set to "Iran". -**Column 12: ``division_exposure``** +Column 12: ``division_exposure`` +-------------------------------------- -Analogous to ``region_exposure`` but for ``division``. If we don’t know the exposure division, we may specify the value for ``country_exposure`` here as well. +Analogous to ``region_exposure`` but for ``division``. If we don't know the exposure division, we may specify the value for ``country_exposure`` here as well. -**Column 13: ``segment``** +Column 13: ``segment`` +-------------------------------------- -Unused. Typically the value “genome” is set here. +Unused. Typically the value "genome" is set here. -**Column 14: ``length``** +Column 14: ``length`` +-------------------------------------- Genome length (numeric value). -**Column 15: ``host``** +Column 15: ``host`` +-------------------------------------- -Host from which the sample was collected. Currently we have multiple values in the dataset, including “Human”, “Canine”, “Manis javanica” and “Rhinolophus affinis”. +Host from which the sample was collected. Currently we have multiple values in the dataset, including "Human", "Canine", "Manis javanica" and "Rhinolophus affinis". -**Column 16: ``age``** +Column 16: ``age`` +-------------------------------------- Numeric age of the patient from whom the sample was collected. We round to an integer value. This will show up in auspice when clicking on the tip in the tree which brings up an info box. -**Column 17: ``sex``** +Column 17: ``sex`` +-------------------------------------- Sex of the patient from whom the sample was collected. This will show up in auspice when clicking on the tip in the tree which brings up an info box. -**Column 18: ``originating_lab``** +Column 18: ``originating_lab`` +-------------------------------------- Please see `GISAID `__ for more information. -**Column 19: ``submitting_lab``** +Column 19: ``submitting_lab`` +-------------------------------------- Please see `GISAID `__ for more information. -**Column 20: ``authors``** +Column 20: ``authors`` +-------------------------------------- -Author of the genome sequence, or the paper which announced this genome. Typically written as “LastName et al”. In our example, this is “Storey et al”. This will show up in auspice when clicking on the tip in the tree which brings up an info box. +Author of the genome sequence, or the paper which announced this genome. Typically written as "LastName et al". In our example, this is "Storey et al". This will show up in auspice when clicking on the tip in the tree which brings up an info box. -**Column 21: ``url``** +Column 21: ``url`` +-------------------------------------- The URL, if available, pointing to the genome data. For most SARS-CoV-2 data this is https://www.gisaid.org. -**Column 22: ``title``** +Column 22: ``title`` +-------------------------------------- The URL, if available, of the publication announcing these genomes. -**Column 23: ``date_submitted``** +Column 23: ``date_submitted`` +-------------------------------------- -Date the genome was submitted to a public database (most often GISAID). In ``YYYY-MM-DD`` format (see ``date`` for more information on this formatting). +Date the genome was submitted to a public database (most often GISAID). In ``YYYY-MM-DD`` format. See :ref:`date ` for more information on this formatting. diff --git a/docs/src/reference/multiple_inputs.md b/docs/src/reference/multiple_inputs.md deleted file mode 100644 index afbd6feb7..000000000 --- a/docs/src/reference/multiple_inputs.md +++ /dev/null @@ -1,225 +0,0 @@ -# Running an analysis starting from multiple inputs - -A common use case is to have a set (or sets) of SARS-CoV-2 sequences which you wish to analyse together. -For instance, you may have a set of freshly generated genomes which you wish to analyse in the context of a larger, worldwide set of genomes such as those found on GISAID. -This tutorial works through such a scenario. - - -We have partitioned the data contained within the main example dataset into two sets: -1. An "Australian" dataset, containing 91 genomes from Victoria, Australia. These were genomes uploaded to NCBI from [Torsten Seemann et al.,](https://www.doherty.edu.au/people/associate-professor-torsten-seemann) but are used in this tutorial to represent a small subset of genomes which may not yet be public. -2. A "worldwide" dataset for context. Often this would be the entire NCBI/GISAID dataset, but here only includes 327 genomes for speed and data-sharing reasons. - - -Our aim is to produce an analysis of the 91 Australian genomes with select worldwide genomes for context. To achieve this, we wish to apply different input-dependent filtering, subsampling and colouring steps. - - - -## Overview of the files used in this tutorial - -The **sequences and metadata** for this tutorial are in `data/example_multiple_inputs.tar.xz` and must be decompressed via `tar xf data/example_multiple_inputs.tar.xz --directory data/`. - -You should now see the following starting files: -```sh -data/example_metadata_aus.tsv # Aus data (n=91) from Seemann et al. -data/example_sequences_aus.fasta -data/example_metadata_worldwide.tsv # Worldwide, contextual data (n=327) -data/example_sequences_worldwide.fasta -``` - -The files are small enough to be examined in a text editor -- the format of the worldwide metadata is similar to the `nextmeta.tsv` file which you may download from GISAID, whereas the format of the Australian metadata is more limited, only containing sampling date and geographic details, which may be more realistic for a newly generated sequencing run. -Note: see `data/example_metadata.tsv` for the full metadata of these Australian samples, we've intentionally restricted this here to mimic a real-world scenario. - - -The **build-specific configs** etc are in `my_profiles/example_multiple_inputs` - -```sh -my_profiles/example_multiple_inputs/config.yaml -my_profiles/example_multiple_inputs/builds.yaml # this is where the input files and parameters are specified -my_profiles/example_multiple_inputs/my_auspice_config.json -``` - - -## Setting up the config - -You can define a single input dataset in `builds.yaml` as follows. - -```yaml -inputs: - - name: my-data - metadata: "data/metadata.tsv" - sequences: "data/sequences.fasta" -``` - -For multiple inputs, you can add another entry to the `inputs` config list. -Here, we will give them the names "aus" and "worldwide": - -```yaml -# my_profiles/example_multiple_inputs/builds.yaml -inputs: - - name: "aus" - metadata: "data/example_metadata_aus.tsv" - sequences: "data/example_sequences_aus.fasta" - - name: "worldwide" - metadata: "data/example_metadata_worldwide.tsv" - sequences: "data/example_sequences_worldwide.fasta" -``` - -### Snakemake terminology - -Inside the Snakemake rules, we use a wildcard `origin` to define different starting points. -For instance, if we ask for the file `results/aligned_worldwide.fasta` then `wildcards.origin="worldwide"` and we expect that the config has defined -a sequences input as shown above. - -## How is metadata combined? - -The different provided metadata files (for `aus` and `worldwide`, defined above) are combined during the pipeline, and the combined metadata file includes all columns present across the different metadata files. -Looking at the individual TSVs, the `worldwide` metadata contains many more columns than the `aus` metadata does, so we can expect the the `aus` samples to have many empty values in the combined metadata. -In the case of **conflicts**, the order of the entries in the YAML matters, with the last value being used. - -Finally, we use one-hot encoding to express the origin of each row of metadata. -This means that **extra columns** will be added for each input (e.g. `aus` and `worldwide`), with values of `"yes"` or `"no"`, representing which samples are contained in each set of sequences. -We are going to use this to our advantage, by adding a coloring to highlight the source of sequences in auspice via `my_profiles/example_multiple_inputs/my_auspice_config.json`: - -```json -"colorings": [ - { - "key": "aus", - "title": "Source: Australia", - "type": "boolean" - } -], -"display_defaults": { - "color_by": "aus" -} -``` - - -## Input-specific filtering parameters - -The first stage of the pipeline performs filtering, masking and alignment (note that this is different to subsampling). -If we have multiple inputs, this stage of the pipeline is done independently for each input. -The parameters used for filtering steps are typically defined by the "filter" dict in the `builds.yaml`, with sensible defaults provided (see `defaults/parameters.yaml`). -For multiple inputs, we can overwrite these for each input. - -As an example, in this tutorial let's ensure we include all the `aus` samples, even if they may be partial genomes etc - -```yaml -# my_profiles/example_multiple_inputs/builds.yaml -filter: - aus: - min_length: 5000 # Allow shorter (partial) genomes - skip_diagnostics: True # skip diagnostics (which can remove genomes) for this input -``` - -## Subsampling parameters - -The second stage of the pipeline subsamples the (often large) dataset. -By this stage, the multiple inputs will have been combined into a unified alignment and metadata file (see above), however we may utilise the fact that the combined metadata has additional columns to represent which samples came from which input source (the columns `aus` and `worldwide`). -This allows us to have per-input subsampling steps. - - -In this example, we want to produce a dataset which contains: -1. _All_ of the samples from the `aus` input (i.e. all of the Australian genomes) -2. A worldwide sampling which prioritises sequences close to (1) -3. A random, background worldwide sampling - -```yaml -# my_profiles/example_multiple_inputs/builds.yaml -builds: - multiple-inputs: - subsampling_scheme: custom-scheme # use a custom subsampling scheme defined below - -# STAGE 2: Subsampling parameters -subsampling: - custom-scheme: - # Use metadata key to include ALL from `input1` - allFromAus: - exclude: "--exclude-where 'aus!=yes'" # subset to sequences from input `aus` - # Proximity subsampling from `worldwide` input to provide context - worldwideContext: - exclude: "--exclude-where 'aus=yes'" # i.e. subset to sequences _not_ from input `aus` - group_by: "year" # NOTE: `augur filter` needs this to use `max_sequences` (TODO) - max_sequences: 100 - priorities: - type: "proximity" - focus: "allFromAus" - worldwideBackground: - exclude: "--exclude-where 'aus=yes'" - group_by: year month - seq_per_group: 5 -``` - -## Run the build - -The following commands will run this tutorial - -```sh -tar xf data/example_multiple_inputs.tar.xz --directory data/ # make sure you have input files! -snakemake --profile my_profiles/example_multiple_inputs -f auspice/ncov_multiple-inputs.json -``` - -The resulting JSON can be dropped onto [auspice.us](https://auspice.us) for visualization. - -The following figure shows the graph (DAG) of steps which Snakemake will run to produce the target auspice JSON. -You can generate this yourself via -`snakemake --profile my_profiles/example_multiple_inputs -f auspice/ncov_multiple-inputs.json --dag | dot -Tpdf > dag.pdf`. - -![snakemake-graph](../images/multiple_inputs_dag.png) - - -## Extra examples - - -### What if I need to preprocess input files beforehand? - -A common use case may be that some of your input sequences and/or metadata may require preprocessing before the pipeline even starts, which will be use-case specific. -To provide an example of this, let's imagine the situation where we haven't uncompressed the starting files, and our "custom preprocessing" step will be to decompress them. -In other words, our preprocessing step will replace the need to run `tar xf data/example_multiple_inputs.tar.xz --directory data/`. - -We can achieve this by creating a snakemake rule which produces all of (or some of) the config-specified input files: - -```python -# my_profiles/example_multiple_inputs/rules.smk -rule make_starting_files: - message: - """ - Creating starting files for the multiple inputs tutorial by decompressing {input.archive} - """ - input: - archive = "data/example_multiple_inputs.tar.xz" - output: - # Note: the command doesn't use these, but adding them here makes snakemake - # aware that this rule produces them - aus_meta = "data/example_metadata_aus.tsv", - aus_seqs = "data/example_sequences_aus.fasta", - world_meta = "data/example_metadata_worldwide.tsv", - world_seqs = "data/example_sequences_worldwide.fasta" - shell: - """ - tar xf {input.archive} --directory data/ - """ -``` - -And then making our build aware of these custom rules: -```yaml -# my_profiles/example_multiple_inputs/builds.yaml -custom_rules: - - my_profiles/example_multiple_inputs/rules.smk -``` - - -### What about if my starting files are stored remotely? - -Currently we can handle files stored on S3 buckets rather than remotely by simply declaring this as the input location: - -```yaml -# your pipeline's builds.yaml config -inputs: - - name: worldwide - metadata: "s3://your_bucket_name/metadata.tsv" - sequences: "s3://your_bucket_name/sequences.fasta.xz" -``` - -> If your S3 bucket is private, make sure you have the following env variables set: `$AWS_SECRET_ACCESS_KEY` and `$AWS_ACCESS_KEY_ID`. - -> You may use `.xz` or `.gz` compression - we automatically infer this from the filename suffix. diff --git a/docs/src/reference/naming_clades.rst b/docs/src/reference/naming_clades.rst index 7819f4997..554ff32a5 100644 --- a/docs/src/reference/naming_clades.rst +++ b/docs/src/reference/naming_clades.rst @@ -1,5 +1,5 @@ Clade Naming & Definitions --------------------------- +========================== The nomenclature used by Nextstrain to designate clades for SARS-CoV-2 is driven by the following objectives: @@ -8,28 +8,31 @@ The nomenclature used by Nextstrain to designate clades for SARS-CoV-2 is driven - provide memorable but informative names, - gracefully handle clade naming in the upcoming years as SARS-CoV-2 becomes a seasonal virus. +.. contents:: Table of Contents + :local: + Major clades -~~~~~~~~~~~~ +------------ Definition -^^^^^^^^^^ +~~~~~~~~~~ We name a new major clade when it reaches a frequency of 20% globally at any time point. When calculating these frequencies, care has to be taken to achieve approximately even sampling of sequences in time and space since sequencing effort varies strongly between countries. A clade name consists of the year it emerged and the next available letter in the alphabet. A new clade should be at least 2 mutations away from its parent major clade. Naming -^^^^^^ +~~~~~~ -We name major clades by the year they are estimated to have emerged and a letter, e.g. 19A, 19B, 20A. The yearly reset of letters will ensure that we don’t progress too far into the alphabet, while the year-prefix provides immediate context on the origin of the clade that will become increasingly important going forward. These are meant as major genetic groupings and not intended to completely resolve genetic diversity. +We name major clades by the year they are estimated to have emerged and a letter, e.g. 19A, 19B, 20A. The yearly reset of letters will ensure that we don't progress too far into the alphabet, while the year-prefix provides immediate context on the origin of the clade that will become increasingly important going forward. These are meant as major genetic groupings and not intended to completely resolve genetic diversity. -The hierarchical structure of clades is sometimes of interest. Here, the “derivation” of a major clade can be labeled with the familiar “.” notation as in 19A.20A.20C for the major clade 20C. +The hierarchical structure of clades is sometimes of interest. Here, the "derivation" of a major clade can be labeled with the familiar "." notation as in 19A.20A.20C for the major clade 20C. Subclades -~~~~~~~~~ +--------- Within these major clades, we subclades, which we will label by their parent clade and the nucleotide mutation(s) that defines them (ex: 19A/28688C). It should be noted however, that these mutations are only meaningful in that they define the clade. Once a subclade reaches (soft) criteria on frequency, spread, and genetic distinctiveness, it will be renamed to a major clade (hypothetically 19A/28688C to 20D). Current Clades -~~~~~~~~~~~~~~ +-------------- +-----------------+--------------------------------------------+-------------------------+-------------------------+ | Clade | Primary Countries | Mutations | Max Frequency | @@ -48,10 +51,10 @@ Current Clades You can view the current clades on the Global SARS-CoV-2 Nextstrain tree `here `__. Identifying Nextstrain Clades -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +----------------------------- -To make it easy for users to identify the Nextstrain clade of their own sequences, we provide a clade assigment tool at `clades.nextstrain.org `__. In addition to assigning clades, this tool will call mutations in your sequences relative to the reference and performs some basic QC. +To make it easy for users to identify the Nextstrain clade of their own sequences, we provide a clade assignment tool at `clades.nextstrain.org `__. In addition to assigning clades, this tool will call mutations in your sequences relative to the reference and performs some basic QC. -You can also use the `simple python script `__ to assign appropriate clades to sequences in a fasta file. This script is part of the ‘ncov’ github repository, but does not require running any other part of the pipeline. However ‘augur’ must be installed to run the script. This can be done `a number of different ways `__, but is often most easily done `using ‘pip’ `__. +You can also use the `simple python script `__ to assign appropriate clades to sequences in a fasta file. This script is part of the ``ncov`` GitHub repository, but does not require running any other part of the workflow. However, ``augur`` :doc:`must be installed ` to run the script. Note when running this script you can supply ``--sequences`` if your sequences require aligning first. If you already have aligned your sequences to the ``ncov`` repository reference (for example, from running the repository), you can supply ``--alignment``. If you supply sequences that are not aligned to the ``ncov`` reference, you may get bad results! diff --git a/docs/src/reference/nextstrain-overview.rst b/docs/src/reference/nextstrain-overview.rst new file mode 100644 index 000000000..2cce72772 --- /dev/null +++ b/docs/src/reference/nextstrain-overview.rst @@ -0,0 +1,51 @@ +Nextstrain overview +=================== + +Nextstrain has two main parts: + +- :term:`docs.nextstrain.org:Augur` **performs the bioinformatic analyses** required to produce a tree, map, and other inferences from your input data. +- The outputs of Augur form the input for :term:`docs.nextstrain.org:Auspice`, **which provides the visualizations** you see on Nextstrain.org + +You can find more information about how these tools fit together :doc:`here `. We'll come back to Auspice when we get to the :doc:`visualization <../visualization/sharing>` section. + +First, let's take a look at how Augur works. + +How bioinformatic analyses are managed +-------------------------------------- + +At its core, Augur is a collection of Python scripts, each of which handles one step in the bioinformatic analyses necessary for visualization with Auspice. + +As you might imagine, keeping track of the input and output files from each step individually can get very confusing, very quickly. So, **to manage all of these steps, we use a workflow manager called Snakemake**. + +.. note:: + + There are many other workflow managers out there, such as Nextflow. While we fully encourage you to use whichever workflow tools you prefer, we only provide support and maintenance for Snakemake. + +Snakemake is an incredibly powerful workflow manager with many complex features. For our purposes, though, we only need to understand a few things: + +- **Each step in a workflow is called a "rule."** The inputs, outputs, and shell commands for each step/rule are defined in a ``.smk`` file. +- Each rule has a number of **parameters, which are specified in a ``.yaml`` file**. +- Each rule produces **output (called a "dependency") which may be used as input to other rules**. + +Overview of a Nextstrain build +------------------------------ + +Below is an illustration of each step in a standard :term:`Nextstrain build `. Dependencies (output files from one step that act as input to the next) are indicated by grey arrows. Input files which must be provided are indicated with red outlines. As you can see in yellow, the final output is a JSON file for visualization in Auspice. + +Required input files (e.g. the sequence data generated in the `data preparation section <../guides/data-prep>`__, or other files which are part of this repo) are indicated with red outlines. We'll walk through each of these in detail in the next section. + +.. figure:: ../images/basic_nextstrain_build.png + :alt: nextstrain_build + +Running multiple builds +----------------------- + +It is common practice to run several related builds. For example, to run one analysis on just your data and another analysis that incorporates background / contextual sequences, you could configure two different builds. + +The ncov workflow facilitates this through the ``builds`` section in a :term:`workflow config file `. This is covered in more detail in the :doc:`genomic surveillance tutorial <../tutorial/genomic-surveillance>`. + +We encourage you to take a look at `main_workflow.smk `__ to see what each rule is doing in more detail. + +.. note:: + + Not all of the rules included are essential, or may even be desirable for your analysis. Your workflow may be able to be made a lot simpler, depending on your goals. diff --git a/docs/src/reference/orientation-files.rst b/docs/src/reference/orientation-files.rst deleted file mode 100644 index 99b05676d..000000000 --- a/docs/src/reference/orientation-files.rst +++ /dev/null @@ -1,50 +0,0 @@ -Overview of this repository (i.e., what do these files do?) -=========================================================== - -The files in this repository fall into one of these categories: \* Input files \* Output files and directories \* Workflow configuration files we might want to customize \* Workflow configuration files we don’t need to touch \* Documentation - -We’ll walk through all of the files one by one, but here are the most important ones for your reference: - -- Input files - - - ``data/metadata.tsv``: tab-delimited description of strain (i.e., sample) attributes - - ``data/sequences.fasta``: genomic sequences whose ids must match the ``strain`` column in ``metadata.tsv``. `See the data preparation guide <../guides/data-prep.md>`__. - - ``my_profiles//builds.yaml``: workflow configuration file where you can define and parameterize the builds you want to run. The directory name ``your_profile`` is the name of your custom analysis profile where you store this configuration and other custom files for the analysis. `See the customization guide `__. - -- Output files - - - ``auspice/.json``: output file for visualization in Auspice where ```` is the name of a build defined in the workflow configuration file. - -Input files ------------ - -- ``data/metadata.tsv``: tab-delimited description of strain (i.e., sample) attributes -- ``data/sequences.fasta``: genomic sequences whose ids must match the ``strain`` column in ``metadata.tsv``. `See the data preparation guide <../guides/data-prep.md>`__. -- ``defaults/include.txt``: list of strain names to *include* during subsampling and filtering (one strain name per line) -- ``defaults/exclude.txt``: list of strain names to *exclude* during subsampling and filtering (one strain name per line) - -Output files and directories ----------------------------- - -- ``auspice/.json``: output file for visualization in Auspice where ```` is the name of your build in the workflow configuration file. -- ``results/aligned.fasta``, etc.: raw results files (dependencies) that are shared across all builds. -- ``results//``: raw results files (dependencies) that are specific to a single build. -- ``logs/``: Log files with error messages and other information about the run. -- ``benchmarks/``: Run-times (and memory usage on Linux systems) for each rule in the workflow. - -Workflow configuration files we might want to customize -------------------------------------------------------- - -- ``my_profiles//builds.yaml``: workflow configuration file where you can define and configure all the builds you’d like to run. `See the customization guide `__. -- ``my_profiles//config.yaml``: `Snakemake profile configuration `__ where you can define the number of cores to use at once, etc. `See the customization guide `__. -- ``defaults/parameters.yaml``: default workflow configuration parameters. Override these settings in the workflow configuration file (``builds.yaml``) above. -- ``defaults/auspice_config.json``: default visualization configuration file. Override these settings in ``my_profiles//auspice_config.json``. `See the customization guide for visualizations `__. - -Workflow configuration files we don’t need to touch ---------------------------------------------------- - -- ``Snakefile``: entry point for Snakemake commands that also validates inputs. No modification needed. -- ``workflow/snakemake_rules/main_workflow.smk``: defines rules for running each step in the analysis. Modify your ``builds.yaml`` file, rather than hardcode changes into the snakemake file itself. -- ``workflow/envs/nextstrain.yaml``: specifies computing environment needed to run workflow with the ``--use-conda`` flag. No modification needed. -- ``workflow/schemas/config.schema.yaml``: defines format (e.g., required fields and types) for ``builds.yaml`` files. This can be a useful reference, but no modification needed. -- ``scripts/``: helper scripts for common tasks. No modification needed. diff --git a/docs/src/reference/orientation-workflow.rst b/docs/src/reference/orientation-workflow.rst deleted file mode 100644 index 10184aa19..000000000 --- a/docs/src/reference/orientation-workflow.rst +++ /dev/null @@ -1,47 +0,0 @@ -Orientation: so, what does Nextstrain *do*? -=========================================== - -| Nextstrain has two main parts: \* **Augur performs the bioinformatic analyses** required to produce a tree, map, and other inferences from your input data. -| \* The outputs of augur form the input for **Auspice, which provides the visualizations** you see on Nextstrain.org - -You can find more information about how these tools fit together `here `__. We’ll come back to Auspice when we get to the `visualization <../visualization/sharing.md>`__ section. - -First, let’s take a look at how augur works. - -How bioinformatic analyses are managed --------------------------------------- - -At its core, augur is a collection of Python scripts, each of which handles one step in the bioinformatic analyses necessary for visualization with auspice. - -As you might imagine, keeping track of the input and output files from each step individually can get very confusing, very quickly. So, **to manage all of these steps, we use a workflow manager called snakemake**. - - *Note: there are many other workflow managers out there, such as nextflow. While we fully encourage you to use whichever workflow tools you prefer, we only provide support and maintenance for snakemake.* - -Snakemake is an incredibly powerful workflow manager with many complex features. For our purposes, though, we only need to understand a few things: - -- **Each step in a workflow is called a “rule.”** The inputs, outputs, and shell commands for each step/rule are defined in a ``.smk`` file. -- Each rule has a number of **parameters, which are specified in a ``.yaml`` file**. -- Each rule produces **output (called a “dependency”) which may be used as input to other rules**. - -Overview of a Nextstrain “build” (analysis workflow) ----------------------------------------------------- - -Below is an illustration of each step in a standard Nextstrain analysis workflow. Dependencies (output files from one step that act as input to the next) are indicated by grey arrows. Input files which must be provided are indicated with red outlines. As you can see in yellow, the final output is a JSON file for visualization in auspice. - -Required input files (e.g. the sequence data generated in the `data preparation section <../guides/data-prep.md>`__, or other files which are part of this repo) are indicated with red outlines. We’ll walk through each of these in detail in the next section. - -.. figure:: ../images/basic_snakemake_build.png - :alt: snakemake_workflow - - snakemake_workflow - -We encourage you to take a look at ```main_workflow.smk`` `__ to see what each rule is doing in more detail. - - Note: Not all of the rules included are essential, or may even be desirable for your analysis. Your build may be able to be made a lot simpler, depending on your goals. - -What’s a “build?” -~~~~~~~~~~~~~~~~~ - -The components in this diagram **constitute a Nextstrain “build” – i.e., a set of commands, parameters and input files which work together to reproducibly execute bioinformatic analyses and generate a JSON for visualization with auspice.** You can learn more about builds `here `__. - -Builds are particularly important if you frequently want to run several different analysis workflows or datasets. For example, if you wanted to run one analysis on just your data and another analysis that incorporates background / contextual sequences, you could configure two different *builds* (one for each of these workflows). This is covered in more detail in the `genomic surveillance tutorial <../tutorial/genomic-surveillance.html>`__. diff --git a/docs/src/reference/remote_inputs.rst b/docs/src/reference/remote_inputs.rst index c392b67aa..ca4a3ab3b 100644 --- a/docs/src/reference/remote_inputs.rst +++ b/docs/src/reference/remote_inputs.rst @@ -1,9 +1,9 @@ -Overview of remote nCoV files (intermediate build assets) -========================================================= +Remote inputs +============= -This page provides an overview of intermediate files which Nextstrain produces. Where appropriate, these files can be starting points for the `ncov pipeline `__ (discussed below). +This page provides an overview of intermediate files which Nextstrain produces via daily workflow runs. Where appropriate, these files can be starting points for the `ncov workflow `__ (discussed below). -We have two GitHub repositories which routinely upload files to `S3 buckets `__: `ncov-ingest `__ and `ncov `__. Each of those runs separate pipelines for GISAID and GenBank (aka “open”) data sources; these pipelines start with data curation and QC steps and end with the phylogenetic analyses you can see on `nextstrain.org `__ +We have two GitHub repositories which routinely upload files to `S3 buckets `__: `ncov-ingest `__ and `ncov `__. Each of those runs separate pipelines for GISAID and GenBank (aka "open") data sources; these pipelines start with data curation and QC steps and end with the phylogenetic analyses you can see on `nextstrain.org `__ The GISAID data is stored at ``s3://nextstrain-ncov-private`` and is not publicly available, in line with the GISAID Terms of Use (this is used internally by Nextstrain). @@ -13,9 +13,9 @@ The open (GenBank) data is publicly available at three endpoints: - ``s3://nextstrain-data/files/ncov/open/`` - ``gs://nextstrain-data/files/ncov/open/`` (mirrored daily from S3 by the Broad Institute) -**Our intention is to make GenBank intermediate files open and available for everyone to use, and to keep these files up-to-date.** The paths for specific files are the same under each endpoint, e.g. ``https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz``, ``s3://nextstrain-data/files/ncov/open/metadata.tsv.gz``, and ``gs://nextstrain-data/files/ncov/open/metadata.tsv.gz`` all exist. See below for a list of files that exist. If you’re running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org URLs. +**Our intention is to make GenBank intermediate files open and available for everyone to use, and to keep these files up-to-date.** The paths for specific files are the same under each endpoint, e.g. ``https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz``, ``s3://nextstrain-data/files/ncov/open/metadata.tsv.gz``, and ``gs://nextstrain-data/files/ncov/open/metadata.tsv.gz`` all exist. See below for a list of files that exist. If you're running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org URLs. -Note that even though the ``s3://nextstrain-data/`` and ``gs://nextstrain-data/`` buckets are public, the defaults for most S3 and GS clients require *some* user to be authenticated, though the specific user/account doesn’t matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using ``https://nextstrain-data.s3.amazonaws.com/files/ncov/open/`` or ``https://storage.googleapis.com/nextstrain-data/files/ncov/open/`` URLs instead. +Note that even though the ``s3://nextstrain-data/`` and ``gs://nextstrain-data/`` buckets are public, the defaults for most S3 and GS clients require *some* user to be authenticated, though the specific user/account doesn't matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using ``https://nextstrain-data.s3.amazonaws.com/files/ncov/open/`` or ``https://storage.googleapis.com/nextstrain-data/files/ncov/open/`` URLs instead. Depending on your execution environment, you may need to install additional Python dependencies for specific support of the different URL schemes (``https``, ``s3``, ``gs``). The workflow will produce an error at the start if additional dependencies are needed to fetch your configured inputs. Both ``https`` and ``s3`` should work out of the box in the standard Nextstrain Conda and Docker execution environments. @@ -127,7 +127,7 @@ Each regional build (``global``, ``africa``, ``asia``, ``europe``, ``north-ameri Starting your build from these intermediates -------------------------------------------- -Each workflow defines one or more inputs in the ``builds.yaml`` file. +Each workflow defines one or more inputs in the workflow config file. In the simplest form, an input specifies a local path to some metadata and sequences, like so: diff --git a/docs/src/reference/troubleshoot.rst b/docs/src/reference/troubleshoot.rst new file mode 100644 index 000000000..085654647 --- /dev/null +++ b/docs/src/reference/troubleshoot.rst @@ -0,0 +1,63 @@ +Troubleshoot common issues +========================== + +If you have a question that is not addressed here, please don't hestitate to `ask for help `__ + +My country / division does not show up on the map +------------------------------------------------- + +This is most often a result of the country / division not being present in `the file defining the latitude & longitude of each deme `__. Adding it to that file (and rerunning the Snakemake rules downstream of this) should fix this. + +My trait (e.g. division) is grey instead of colored +--------------------------------------------------- + +We generate the colors from the ``colors`` rule in the Snakefile, which uses the `ordering TSV `__ to generate these. See :doc:`../guides/workflow-config-file` for more info. + +*A note about locations and colors:* Unless you want to specifically override the colors generated, it's usually easier to *add* information to the default ``ncov`` files, so that you can benefit from all the information already in those files. + +My genomes aren't included in the analysis +------------------------------------------ + +There are a few steps where sequences can be removed: + +- During the ``filter`` step: + + - Samples that are included in `the exclude file `__ are removed + - Samples that fail the current filtering criteria, as defined in the ``parameters.yaml`` file, are removed. You can modify the snakefile as desired, but currently these are: + + - Minimum sequence length of 25kb + - No ambiguity in (sample collection) date + + - Samples may be randomly removed during subsampling; see :doc:`../guides/workflow-config-file` for more info. + - During the ``refine`` step, where samples that deviate more than 4 interquartile ranges from the root-to-tip vs time are removed + +Sequencing and alignment errors +------------------------------- + +Genome sequencing, bioinformatic processing of the raw data, and alignment of the sequences are all steps were errors can slip in. Such errors can distort the phylogenetic analysis. To avoid sequences with known problems to mess up the analysis, we keep a list of problematic sequences in ``config/exclude.txt`` and filter them out. To facilitate spotting such problematic sequences, we added an additional quality control step that produces the files + +- ``results/sequence-diagnostics.tsv`` +- ``results/flagged-sequences.tsv`` +- ``results/to-exclude.txt`` + +These files are the output of ``scripts/diagnostics.py`` and are produced by rule ``diagnostic``. The first file contains statistics for every sequence in the alignment, sorted by divergence worst highest to lowest. The second file contains only those sequences with diagnostics exceeding thresholds each with their specific reason for flagging – these are sorted by submission date (newest to oldest). The third file contains only the names of the flagged sequences and mirrors the format of ``config/exclude.txt``. These names could be added to ``config/exclude.txt`` for permanent exclusion. Note, however, that some sequences might look problematic due to alignment issues rather than intrinsic problems with the sequence. The flagged sequences will be excluded from the current run. + +To only run the sequence diagnostic, you can specify any of the three above files as target, or use the ``diagnostic`` target: + +.. code:: bash + + nextstrain build ... diagnostic + +In addition, we provide rules to re-examine the sequences in ``config/exclude.txt``. By running + +.. code:: bash + + nextstrain build ... diagnose_excluded + +the workflow will produce + +- ``results/excluded-sequence-diagnostics.tsv`` +- ``results/excluded-flagged-sequences.tsv`` +- ``results/check-exclusion.txt`` + +These files are meant to facilitate checking whether sequences in ``config/exclude.txt`` are excluded for valid reasons. diff --git a/docs/src/reference/configuration.rst b/docs/src/reference/workflow-config-file.rst similarity index 95% rename from docs/src/reference/configuration.rst rename to docs/src/reference/workflow-config-file.rst index 5d2a85704..f11338d67 100644 --- a/docs/src/reference/configuration.rst +++ b/docs/src/reference/workflow-config-file.rst @@ -1,7 +1,9 @@ -.. cssclass:: configurationparameters +.. cssclass:: configuration-reference -Configuration parameters for Nextstrain SARS-CoV-2 workflow -=========================================================== +Workflow config file reference +============================== + +This is the detailed reference for sections in a :term:`workflow config file `. For example use cases, see :doc:`../guides/workflow-config-file`. .. contents:: Table of Contents :local: @@ -232,7 +234,7 @@ sampling_scheme - type: string - description: A flag to pass to ``augur filter`` that specifies whether to enable probabilistic sampling or not. Probabilistic sampling is useful when there are more groups than requested sequences. -- default: ``--probabilistic-sampling`` (Augur’s default) +- default: ``--probabilistic-sampling`` (Augur's default) - examples: - ``--probabilistic-sampling`` @@ -392,7 +394,7 @@ auspice_json_prefix ------------------- - type: string -- description: Prefix to use for Auspice JSON outputs. Change this value to produce JSONs named like ``auspice/_global.json`` for a build named ``global``, for example. If you are using `Nextstrain’s Community Sharing `__ to view your builds, set this value to your GitHub repository name and the ``ncov`` default. For example, if your repository is named ``evolution``, set ``auspice_json_prefix: evolution_ncov`` to get JSONs you can view your ``global`` build at https://nextstrain.org/community/*your_github_organization*/evolution/ncov/global. +- description: Prefix to use for Auspice JSON outputs. Change this value to produce JSONs named like ``auspice/_global.json`` for a build named ``global``, for example. If you are using :doc:`Nextstrain's Community Sharing ` to view your builds, set this value to your GitHub repository name and the ``ncov`` default. For example, if your repository is named ``evolution``, set ``auspice_json_prefix: evolution_ncov`` to get JSONs you can view your ``global`` build at https://nextstrain.org/community/*your_github_organization*/evolution/ncov/global. - default: ``ncov`` @@ -436,7 +438,7 @@ conda_environment ----------------- - type: string -- description: Path to a Conda environment file to use for the workflow when the workflow is run with `Snakemake’s ``--use-conda`` flag `__. +- description: Path to a Conda environment file to use for the workflow when the workflow is run with `Snakemake's ``--use-conda`` flag `__. - default: ``workflow/envs/nextstrain.yaml`` custom_rules @@ -609,7 +611,7 @@ database_id_columns ~~~~~~~~~~~~~~~~~~~ - type: object -- description: A list of columns representing external database ids for metadata records. These unique ids represent a snapshot of data at a specific time for a given strain name. The sanitize metadata script resolves duplicate metadata records for the same strain name by selecting the record with the latest database id. Multiple database id columns allow the script to resolve duplicates when one or more columns has ambiguous values (e.g., “?”). Deduplication occurs before renaming of columns, so the default values include GISAID’s own “Accession ID” as well as Nextstrain-style database ids. +- description: A list of columns representing external database ids for metadata records. These unique ids represent a snapshot of data at a specific time for a given strain name. The sanitize metadata script resolves duplicate metadata records for the same strain name by selecting the record with the latest database id. Multiple database id columns allow the script to resolve duplicates when one or more columns has ambiguous values (e.g., “?”). Deduplication occurs before renaming of columns, so the default values include GISAID's own “Accession ID” as well as Nextstrain-style database ids. - default: .. code:: yaml @@ -714,7 +716,7 @@ run_pangolin ------------ - type: boolean -- description: Enable annotation of Pangolin lineages for a given build’s subsampled sequences. +- description: Enable annotation of Pangolin lineages for a given build's subsampled sequences. - default: ``false`` @@ -946,7 +948,7 @@ sampling_bias_correction ~~~~~~~~~~~~~~~~~~~~~~~~ - type: float -- description: A rough estimate of how many more events would have been observed if sequences represented an even sample. `See the documentation for ``augur traits`` for more details `__. +- description: A rough estimate of how many more events would have been observed if sequences represented an even sample. :doc:`See the documentation for augur traits for more details `. - default: ``2.5`` columns @@ -988,13 +990,13 @@ max_date 2. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. '2020-06-04') or 3. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. '1W', 'P1W') -- default: without value supplied, defaults to today’s date minus ``recent_days_to_censor`` parameter +- default: without value supplied, defaults to today's date minus ``recent_days_to_censor`` parameter recent_days_to_censor ~~~~~~~~~~~~~~~~~~~~~ - type: integer -- description: How many days back from today’s date should samples be hidden from frequencies calculations? This is in place to help with sampling bias where some regions have faster sequencing turnarounds than other regions. +- description: How many days back from today's date should samples be hidden from frequencies calculations? This is in place to help with sampling bias where some regions have faster sequencing turnarounds than other regions. - default: without value supplied, defaults to ``0`` pivot_interval diff --git a/docs/src/tutorial/custom-data.rst b/docs/src/tutorial/custom-data.rst index ddaac3c84..216efea8e 100644 --- a/docs/src/tutorial/custom-data.rst +++ b/docs/src/tutorial/custom-data.rst @@ -10,7 +10,7 @@ Prerequisites ------------- 1. :doc:`example-data`. This tutorial sets up the command line environment used in the following tutorial. -2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../reference/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. +2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../guides/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. Setup ----- @@ -82,7 +82,7 @@ We will retrieve 10 sequences from GISAID's EpiCoV database. .. hint:: - Read :doc:`the full data prep guide <../reference/data-prep/index>` for other ways to curate custom data. + Read :doc:`the full data prep guide <../guides/data-prep/index>` for other ways to curate custom data. Run the workflow ---------------- diff --git a/docs/src/tutorial/example-data.rst b/docs/src/tutorial/example-data.rst index f5dbff60a..dcddcc203 100644 --- a/docs/src/tutorial/example-data.rst +++ b/docs/src/tutorial/example-data.rst @@ -70,7 +70,7 @@ The workflow can take several minutes to run. While it is running, you can learn The ``refine`` entry specifies the root sequence for the example GenBank data. - For more information, see :doc:`../reference/configuration-reference`. + For more information, see :doc:`../reference/workflow-config-file`. The workflow output produces a new directory ``auspice/`` containing a file ``ncov_default-build.json``, which will be visualized in the following section. The workflow also produces intermediate files in a new ``results/`` directory. diff --git a/docs/src/tutorial/genomic-surveillance.rst b/docs/src/tutorial/genomic-surveillance.rst index 35701cb04..39c70653e 100644 --- a/docs/src/tutorial/genomic-surveillance.rst +++ b/docs/src/tutorial/genomic-surveillance.rst @@ -10,7 +10,7 @@ Prerequisites ------------- 1. :doc:`custom-data`. This tutorial introduces concepts expanded by the following tutorial. -2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../reference/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. +2. You have a GISAID account. `Register `__ if you do not have an account yet. However, registration may take a few days. Follow :doc:`alternative data preparation methods <../guides/data-prep/index>` in place of **Curate data from GISAID** if you wish to continue the following tutorial in the meantime. Setup ----- diff --git a/docs/src/tutorial/index.rst b/docs/src/tutorial/index.rst deleted file mode 100644 index f195fe0a5..000000000 --- a/docs/src/tutorial/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -******** -Tutorial -******** - -.. toctree:: - :maxdepth: 1 - :titlesonly: - :caption: Table of contents - - intro - setup - example-data - custom-data - genomic-surveillance - running diff --git a/docs/src/tutorial/next-steps.rst b/docs/src/tutorial/next-steps.rst index e4ad3f597..66d64c1aa 100644 --- a/docs/src/tutorial/next-steps.rst +++ b/docs/src/tutorial/next-steps.rst @@ -7,6 +7,8 @@ Congratulations! You have completed all of the tutorials for the ncov workflow. .. contents:: Table of Contents :local: +.. _create-analysis-directory: + Create your own analysis directory ================================== @@ -72,5 +74,5 @@ Additional resources - `Genomic Epidemiology Seminar Series `__ by Chan Zuckerberg Initiative Genomic Epidemiology (CZ GEN EPI) - `COVID-19 Genomic Epidemiology Toolkit `__ by Centers for Disease Control and Prevention (CDC) -- :doc:`Review all possible options to configure your SARS-CoV-2 analyses with Nextstrain <../reference/configuration-reference>`. +- :doc:`Review all possible options to configure your SARS-CoV-2 analyses with Nextstrain <../reference/workflow-config-file>`. - Watch `this 1-hour video overview `__ by Heather Blankenship on how to deploy Nextstrain for a Public Health lab. diff --git a/docs/src/tutorial/running.rst b/docs/src/tutorial/running.rst deleted file mode 100644 index fc0e27c17..000000000 --- a/docs/src/tutorial/running.rst +++ /dev/null @@ -1,138 +0,0 @@ -Running the analysis -==================== - - This section focuses on how to running the basic example build to give you a chance to practice and get a sense of how things work. The next section covers customizing and configuring your own build. - -**To run our analyses, we need to:** 1. Ensure our **sequence data and metadata is**\ `properly formatted <../guides/data-prep.md>`__ 2. **Specify which builds you want** to generate using a ``builds.yaml`` file 3. **Execute the workflow** 4. [Hopefully you don’t have to] **troubleshoot** - -Step 1. Gather and format your data ------------------------------------ - -If you haven’t done this step yet, check out our `data prep <../guides/data-prep.md>`__ guide and come back when you’re ready. - -Step 2. Specify which builds to run ------------------------------------ - -In the orientation section, we learned that - `Nextstrain analyses are run using a workflow manager called Snakemake <../reference/orientation-workflow.md>`__ - `A “build” `__ is a bundle of input files, parameters, and commands - `Each build is primarily configured by your ``builds.yaml`` file <../reference/orientation-files.md>`__: ``builds.yaml`` and ``config.yaml`` - -Let’s start with defining a build in ``./my_profiles/example/builds.yaml``. **We use the ``builds.yaml`` file to define what geographic areas of the world we want to focus on. Each block in this file will produce a separate output JSON for visualization**. - -The first block of the provided ``./my_profiles/example/builds.yaml`` file looks like this: - -:: - - builds: - # Focus on King County (location) in Washington State (division) in the USA (country) - # with a build name that will produce the following URL fragment on Nextstrain/auspice: - # /ncov/north-america/usa/washington/king-county - north-america_usa_washington_king-county: # name of the build; this can be anything - subsampling_scheme: location # what subsampling method to use (see parameters.yaml) - region: North America - country: USA - division: Washington - location: King County - # Whatever your lowest geographic area is (here, 'location' since we are doing a county in the USA) - # list 'up' from here the geographic area that location is in. - # Here, King County is in Washington state, is in USA, is in North America. - -Looking at this example, we can see that each build has a: - -- ``build_name``, which is used for naming output files -- ``subsampling_scheme``, which specifies how sequences are selected. Default schemes exist for ``region``, ``country``, and ``division``. Custom schemes `can be defined <../reference/customizing-analysis.md>`__. -- ``region``, ``country``, ``division``, ``location``: specify geographic attributes of the sample used for subsampling - -The rest of the builds defined in this file serve as examples for division-, country- or region-focused analyses. To adapt this for your own analyses: - -1. copy ``my_profiles/example`` to ``my_profiles/`` -2. open and modify the ``builds.yaml`` file in this directory to: - - - include your geographic area(s) of interest - - remove any builds that are not relevant to your work - - include the path to your own sequences and metadata instead of the example data - -3. open and modify the ``config.yaml`` file in this directory such that it references: - - - the path to your new custom ``builds.yaml`` instead of the example builds file - -Step 3: Run the workflow ------------------------- - -To actually execute the workflow, run: - -.. code:: bash - - ncov$ snakemake --profile my_profiles/example -p - -``--profile`` tells snakemake where to find your ``builds.yaml`` and ``config.yaml`` files. ``-p`` tells snakemake to print each command it runs to help you understand what it’s doing. - -If you’d like to run a dryrun, try running with the ``-np`` flag, which will execute a dryrun. This prints out each command, but doesn’t execute it. - -Note that the example profile runs the workflow with at most two cores at once, as defined by the ``cores`` parameter in ``my_profiles/example/config.yaml``. Snakemake requires you to specify how many cores to use at once. To define the number of cores to use from the command line, run Snakemake as follows. - -.. code:: bash - - ncov$ snakemake --cores 1 --profile my_profiles/example -p - -Step 4: Troubleshoot common issues ----------------------------------- - -If you have a question that is not addressed here, please don’t hestitate to `ask for help `__ - -My country / division does not show up on the map -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This is most often a result of the country / division not being present in `the file defining the latitude & longitude of each deme `__. Adding it to that file (and rerunning the Snakemake rules downstream of this) should fix this. - -My trait (e.g. division) is grey instead of colored -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -We generate the colors from the ``colors`` rule in the Snakefile, which uses the `ordering TSV `__ to generate these. See `‘customizing your analysis’ <../reference/customizing-analysis.md>`__ for more info. - -*A note about locations and colors:* Unless you want to specifically override the colors generated, it’s usually easier to *add* information to the default ``ncov`` files, so that you can benefit from all the information already in those files. - -My genomes aren’t included in the analysis -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -There are a few steps where sequences can be removed: - -- During the ``filter`` step: - - - Samples that are included in `the exclude file `__ are removed - - Samples that fail the current filtering criteria, as defined in the ``parameters.yaml`` file, are removed. You can modify the snakefile as desired, but currently these are: - - - Minimum sequence length of 25kb - - No ambiguity in (sample collection) date - - - Samples may be randomly removed during subsampling; see `‘customizing your analysis’ <../reference/customizing-analysis.md>`__ for more info. - - During the ``refine`` step, where samples that deviate more than 4 interquartile ranges from the root-to-tip vs time are removed - -Sequencing and alignment errors -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Genome sequencing, bioinformatic processing of the raw data, and alignment of the sequences are all steps were errors can slip in. Such errors can distort the phylogenetic analysis. To avoid sequences with known problems to mess up the analysis, we keep a list of problematic sequences in ``config/exclude.txt`` and filter them out. To facilitate spotting such problematic sequences, we added an additional quality control step that produces the files - -- ``results/sequence-diagnostics.tsv`` -- ``results/flagged-sequences.tsv`` -- ``results/to-exclude.txt`` - -These files are the output of ``scripts/diagnostics.py`` and are produced by rule ``diagnostic``. The first file contains statistics for every sequence in the aligment, sorted by divergence worst highest to lowest. The second file contains only those sequences with diagnostics exceeding thresholds each with their specific reason for flagging – these are sorted by submission date (newest to oldest). The third file contains only the names of the flagged sequences and mirrors the format of ``config/exclude.txt``. These names could be added to ``config/exclude.txt`` for permanent exclusion. Note, however, that some sequences might look problematic due to alignment issues rather than intrinsic problems with the sequence. The flagged sequences will be excluded from the current run. - -To only run the sequence diagnostic, you can specify any of the three above files as target or run: - -.. code:: bash - - snakemake --profile my_profiles/ diagnostic - -In addition, we provide rules to re-examine the sequences in ``config/exclude.txt``. By running - -.. code:: bash - - snakemake --profile my_profiles/ diagnose_excluded - -the pipeline will produce - -- ``results/excluded-sequence-diagnostics.tsv`` -- ``results/excluded-flagged-sequences.tsv`` -- ``results/check-exclusion.txt`` - -These files are meant to facilitate checking whether sequences in ``config/exclude.txt`` are excluded for valid reasons. diff --git a/docs/src/tutorial/setup.rst b/docs/src/tutorial/setup.rst index fa0cea704..38c31d54c 100644 --- a/docs/src/tutorial/setup.rst +++ b/docs/src/tutorial/setup.rst @@ -3,6 +3,9 @@ Setup and installation The following steps will prepare you to run complete analyses of SARS-CoV-2 data by installing required software and running a simple example workflow. +.. contents:: Table of Contents + :local: + 1. Install Nextstrain components -------------------------------- @@ -27,66 +30,3 @@ Use Git to download a copy of the ncov repository containing the workflow and th .. code:: bash git clone https://github.com/nextstrain/ncov.git - cd ncov - -Alternately, `download a compressed copy of the ncov repository `__ called ``ncov-master.zip``. Open this file to decompress it and create a directory called ``ncov-master/`` with the contents of the workflow in it. Navigate to this directory from the command line. - -Update the workflow -~~~~~~~~~~~~~~~~~~~ - -We update the official workflow regularly with: - -- `curated metadata including latitudes/longitudes, clade annotations, and low quality sequences `__ -- bug fixes -- `new features <../reference/change_log>`__ - -Update your local copy of the workflow, to benefit from these changes. - -.. code:: bash - - # Download and apply changes from the Nextstrain team. - # This only works if there is no conflict with your local repository. - git pull --ff-only origin master - - # OR: - - # Alternately, download and apply changes from the Nextstrain team - # and then replay your local changes on top of those incoming changes. - git pull --rebase origin master - -Alternately, download a specific version of the workflow that you know works for you. We create new `releases of the workflow `__ any time we introduce breaking changes, so you can choose when to update based on `what has changed <../reference/change_log>`__. - -.. code:: bash - - # Download version 7 (v7) of the workflow. - curl -OL https://github.com/nextstrain/ncov/archive/refs/tags/v7.zip - - # Uncompress the workflow. - unzip v7.zip - - # Change into the workflow's directory. - cd ncov-7/ - -3. Run a basic analysis with example data ------------------------------------------ - -Run a basic workflow with example data, to confirm that your :term:`Nextstrain runtime ` is properly configured. - -.. code:: bash - - nextstrain build . --cores 4 \ - --configfile ./my_profiles/getting_started/builds.yaml - -The ``nextstrain build`` command runs a :term:`pathogen workflow ` defined using Snakemake. Since our ``Snakefile`` is in the current directory, we specify the directory as ``.``. All other arguments pass through to Snakemake. - -The ``getting_started`` build produces a minimal global phylogeny for visualization in :term:`docs.nextstrain.org:Auspice`. This workflow should complete in about 5 minutes on a MacBook Pro (2.7 GHz Intel Core i5) with four cores. - -4. Visualize the phylogeny for example data -------------------------------------------- - -`Open auspice.us `__ in your browser. Drag and drop the :term:`JSON file ` ``auspice/ncov_global.json`` anywhere on the landing page, to visualize the resulting phylogeny. The resulting phylogeny should look something like this: - -.. figure:: ../images/getting-started-tree.png - :alt: Phylogenetic tree from the “getting started” build as visualized in Auspice - - Phylogenetic tree from the “getting started” build as visualized in Auspice diff --git a/docs/src/visualization/index.rst b/docs/src/visualization/index.rst deleted file mode 100644 index 033489afb..000000000 --- a/docs/src/visualization/index.rst +++ /dev/null @@ -1,14 +0,0 @@ -*************************************************** -Getting started with visualization & interpretation -*************************************************** - -The starting point for this section is a JSON file. You can alternately use our examples to start. - -.. toctree:: - :maxdepth: 1 - :titlesonly: - :caption: Table of contents - - sharing - interpretation - narratives diff --git a/docs/src/visualization/interpretation.rst b/docs/src/visualization/interpretation.rst index cf16eacba..3eea237ae 100644 --- a/docs/src/visualization/interpretation.rst +++ b/docs/src/visualization/interpretation.rst @@ -8,7 +8,7 @@ Introductory resources - Introduction to interpreting phylogenetic trees: https://nextstrain.org/narratives/trees-background/ -- How to interact with auspice (the engine for viewing trees): https://neherlab.org/201901_krisp_auspice.html +- How to interact with Auspice (the engine for viewing trees): https://neherlab.org/201901_krisp_auspice.html - Overview of genomic epidemiology (older, but still relevant and clear): http://evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/EvolAnalysisOfDynamicsOfViruses.pdf @@ -19,7 +19,7 @@ Case Studies - UK analysis of hospital-acquired infections: https://www.medrxiv.org/content/10.1101/2020.05.08.20095687v1 -- UK’s analysis of coronavirus introductions: https://virological.org/t/preliminary-analysis-of-sars-cov-2-importation-establishment-of-uk-transmission-lineages/507 +- UK's analysis of coronavirus introductions: https://virological.org/t/preliminary-analysis-of-sars-cov-2-importation-establishment-of-uk-transmission-lineages/507 - Australia cluster detection: https://www.medrxiv.org/content/10.1101/2020.05.12.20099929v1 diff --git a/docs/src/visualization/narratives.rst b/docs/src/visualization/narratives.rst index 33956aee9..5e65618bd 100644 --- a/docs/src/visualization/narratives.rst +++ b/docs/src/visualization/narratives.rst @@ -5,7 +5,8 @@ Nextstrain Narratives allow you to pair a specific view of a dataset with text a For examples, `see our weekly Situation Reports `__ from the first several months of the pandemic. -| You can `read more about Narratives here `__. We’ve also `provided a template narrative file `__ for you to edit. -| You can preview the template narrative by navigating to https://nextstrain.org/community/narratives/nextstrain/ncov/template_narrative. +You can `read more about Narratives here `__. We've also `provided a template narrative file `__ for you to edit. -We’ll add more to this page soon; in the meantime, if you get stuck, don’t hesitate to `ask for help `__! :) +You can preview the template narrative by navigating to https://nextstrain.org/community/narratives/nextstrain/ncov/template_narrative. + +We'll add more to this page soon; in the meantime, if you get stuck, don't hesitate to `ask for help `__! :) diff --git a/docs/src/visualization/sharing.rst b/docs/src/visualization/sharing.rst index b24db13ab..29bd1856c 100644 --- a/docs/src/visualization/sharing.rst +++ b/docs/src/visualization/sharing.rst @@ -3,7 +3,7 @@ Visualizing and sharing results `Nextstrain.org `__ uses Auspice to visualize JSON files that are created by Augur. While this is the most visible example, there are many other ways to use Auspice to visualize your results. -We’ll walk through each of these in detail, roughly in order from the simplest and most private to the slightly more complex and publicly shareable. +We'll walk through each of these in detail, roughly in order from the simplest and most private to the slightly more complex and publicly shareable. If none of these options meet your needs, please `get in touch `__! @@ -25,7 +25,7 @@ How to view visualize private metadata 2. Add your sensitive metadata to the remaining columns 3. On your computer, drag and drop the file onto the browser window where Auspice is visualizing your JSON -*For more help formatting this metadata file, including how to do so using Excel,*\ `see this page <../guides/data-prep.md>`__ +*For more help formatting this metadata file, including how to do so using Excel,* :doc:`see this page <../guides/data-prep/index>`. -------------- @@ -34,7 +34,7 @@ Option 1: Drag-and-drop web-based visualization - **Quickstart**: Drag-and-drop the file from ``./auspice/sarscov2_global.json`` onto the page at https://auspice.us. - **Advantages:** Quick, no-setup viewing of results, including sensitive data. -- **Limitations:** Requires separate management of JSON file sharing and version control. Sharing a specific view via a URL isn’t possible with this method. +- **Limitations:** Requires separate management of JSON file sharing and version control. Sharing a specific view via a URL isn't possible with this method. How to view ^^^^^^^^^^^ @@ -56,16 +56,17 @@ When your browser connects to auspice.us, it downloads from the server a version Option 2: Nextstrain community pages ------------------------------------ -- **Example:** `CZBiohub’s California COVID Tracker `__ +- **Example:** `CZBiohub's California COVID Tracker `__ - **Advantages:** Fully featured, plug-and-play visualization of any JSON file hosted on Github. - **Limitations:** Only available for publicly viewable JSON files in public repositories. How to get started ^^^^^^^^^^^^^^^^^^ -| Quickstart: -| 1. Put your JSON in a github repository like so: ``myGithubOrganization/myRepository/auspice/.json`` -| 2. Navigate to ``https://nextstrain.org/community/myGithubOrganization/myRepository/myBuildName`` 3. [Optional] Drag and drop a TSV with additional or private metadata onto the page (see above) +Quickstart: + +1. Put your JSON in a github repository like so: ``myGithubOrganization/myRepository/auspice/.json`` +2. Navigate to ``https://nextstrain.org/community/myGithubOrganization/myRepository/myBuildName`` 3. [Optional] Drag and drop a TSV with additional or private metadata onto the page (see above) Check out our `full guide to community pages here `__. @@ -81,7 +82,7 @@ Option 3: Local viewing on your computer with Auspice - **Quickstart**: ``ncov$ auspice view`` - **Advantages:** Offline, entirely local viewing of results, including sensitive data. -- **Limitations:** Requires collaborators to install Auspice locally. Requires separate management of JSON file sharing and version control. Sharing a specific view via a URL isn’t possible with this method. +- **Limitations:** Requires collaborators to install Auspice locally. Requires separate management of JSON file sharing and version control. Sharing a specific view via a URL isn't possible with this method. .. _how-to-view-1: @@ -89,7 +90,7 @@ How to view ^^^^^^^^^^^ 1. Follow the instructions `here `__ to install Auspice on your computer. -2. Make sure the JSON you’d like to visualize is in ``./auspice/.json``; alternatively, pass the ``--datasetDir`` flag to specify another directory. +2. Make sure the JSON you'd like to visualize is in ``./auspice/.json``; alternatively, pass the ``--datasetDir`` flag to specify another directory. 3. Run ``auspice view`` and select the build of interest. 4. [Optional] drag and drop a TSV with additional or private metadata onto the page (see above) @@ -112,14 +113,14 @@ Option 4: Sharing with Nextstrain Groups - **Example:** https://nextstrain.org/groups/blab/ - **Advantages:** Web-based viewing of results with full authentication / login controls; accommodates both public and private datasets. Sharing a specific view via URL is possible with this method. -- **Limitations:** Setup is slightly more involved, but we’re ready to help! +- **Limitations:** Setup is slightly more involved, but we're ready to help! .. _how-to-get-started-1: How to get started ^^^^^^^^^^^^^^^^^^ -Nextstrain Groups are a new feature; if you’d like to use this option, please `get in touch `__ and we’ll help you get started right away! +Nextstrain Groups are a new feature; if you'd like to use this option, please `get in touch `__ and we'll help you get started right away! .. _privacy-and-security-3: