diff --git a/src/index.rst b/src/index.rst index a2e5d160..fb708f02 100644 --- a/src/index.rst +++ b/src/index.rst @@ -58,6 +58,8 @@ team and other Nextstrain users provide assistance. For private inquiries, tutorials/creating-a-bacterial-phylogenetic-workflow tutorials/narratives-how-to-write Analyzing genomes with Nextclade + tutorials/using-a-pathogen-repo/index + tutorials/creating-a-pathogen-repo/index .. toctree:: :maxdepth: 1 diff --git a/src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst b/src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst new file mode 100644 index 00000000..8b6ca5c8 --- /dev/null +++ b/src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst @@ -0,0 +1,470 @@ +=========================== +Creating an ingest workflow +=========================== + +This tutorial dissects the `ingest workflow of the pathogen-repo-guide `_ +and the decisions needed to create an ingest workflow for a new pathogen. + +.. note:: + + You only need to create an ingest workflow if you do **not** want to use an existing pathogen ingest workflow maintained by Nextstrain. + +.. contents:: Table of Contents + :local: + :depth: 2 + +Prerequisites +============= + +1. :doc:`Install Nextstrain `. +2. Run through the :doc:`../using-a-pathogen-repo/running-an-ingest-workflow` tutorial. This will verify your installation and ensure that you are able to run an ingest workflow. + +Additionally, to follow this tutorial, you will need: + +* An understanding of `Snakemake `_ workflows. +* Pathogen-specific knowledge (e.g. WHO naming scheme) to help with decisions on how to set up the ingest workflow + +Setup +===== + +The Nextstrain `pathogen-repo-guide `_ can be used for setting up a +pathogen repository to hold the files necessary to run and maintain pathogen workflows. +This tutorial will only focus on using the guide to set up the ingest workflow. + +1. Go to the Nextstrain `pathogen-repo-guide repository `_ +2. Follow the `GitHub guide for creating a repository from a template `_. +3. Follow the `GitHub guide to download the new repository `_. +4. Change directory to your new pathogen repository + +.. code-block:: + + cd + +Decide on data source +===================== + +The first step for creating an ingest workflow is to decide on the data source for your pathogen's data. +The pathogen-repo-guide only focuses on downloading public data from `NCBI `_, +using the rules defined in ``ingest/rules/fetch_from_ncbi.smk``. + +.. note:: + + If your pathogen does not have sequences on NCBI, then you will need to explore other data sources that are not + covered in this tutorial. + +NCBI Datasets +------------- + +By default, the pathogen-repo-guide is set to use the `NCBI Datasets CLI `_ +tool to download viral sequences using a provided `NCBI taxonomy `_ ID. +This is the simplest route for setting up an ingest workflow, but it is limited to a +`standard set of fields `_ +that is parsed by NCBI Datasets. + +You can decide whether NCBI Datasets include sufficient data for your pathogen by inspecting the uncurated data from NCBI Datasets CLI. + +1. Add your pathogen's NCBI taxonomy ID to the ``ncbi_taxon_id`` parameter in the ``ingest/defaults/config.yaml`` config file. +2. Dump the uncurated metadata by running + +.. code-block:: + + nextstrain build ingest dump_ncbi_dataset_report + +3. Inspect the generated file ``ingest/data/ncbi_dataset_report_raw.tsv`` +4. If there are other fields in the raw file that you would like to include in the workflow, + you can add them to the ``ncbi_datasets_fields`` parameter + +If the data looks sufficient for your pathogen, then skip to the :ref:`curation-steps`. + +NCBI Entrez +----------- + +If your pathogen requires data from other fields not parsed by NCBI Datasets, then you will need to use +the `NCBI Entrez `_ tool to download all available data in a GenBank file. + +1. Add an Entrez search term to the ``entrez_search_term`` parameter in the ``ingest/defaults/config.yaml`` config file. + +2. Create a custom script to parse the GenBank file into a flat `JSON Lines/NDJSON format `_. + (We may provide an example script in the future, but this is currently not available.) + +3. Edit the ``parse_genbank_to_ndjson`` rule in ``ingest/rules/fetch_from_ncbi.smk`` to use the custom script. + +4. Switch the `Snakemake ruleorder `_ + within the ``ingest/rules/fetch_from_ncbi.smk`` file. + +.. code-block:: + + ruleorder: format_ncbi_datasets_ndjson < parse_genbank_to_ndjson + +4. Make sure the ``field_map`` parameters in the config file are using the field names of your custom NDJSON output. + +.. _curation-steps: + +Curation steps +============== + +After the public data is downloaded, the next part of the workflow runs a pipeline of data curation commands and scripts +to format the metadata and sequences. + +The long term goal is to build out the :doc:`augur curate ` +suite of commands to include all of the custom curation steps. +For now, we've bundled custom scripts into the `ingest repository `_ that is then +vendored in the pathogen-repo-guide using `git-subrepo `_. +Please do not edit the vendored scripts in ``ingest/vendored`` directly. +If you run into issues or encounter bugs with the vendored scripts, please `make an issue in the ingest repository `_. +Once the bug has been fixed in the original source code, you can follow the `instructions to update the vendored scripts `_. + +We highly encourage you to go through the commands and custom scripts used in the ``curate`` rule within ``ingest/rules/curate.smk`` +to gain a deeper understanding of how they work. +We will give a brief overview of each step and their relevant config parameters defined in ``ingest/defaults/config.yaml`` to help you get started. + +Transform field names +--------------------- + +The ``ingest/vendored/transform-field-names`` script will rename the fields in the NDJSON records. + +.. note:: + + This is the first step of the pipeline so any subsequent references to field names should use the new field names. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.field_map`` + + * A dictionary where the key is the original field name and value is the new field name + + * The default dictionary uses the original field names from NCBI Datasets and transforms them to the standard Nextstrain metadata fields. + +Normalize strings +----------------- + +The :doc:`augur curate normalize-strings ` command will normalize string +values in the NDJSON records for predictable string comparisons. +Currently, there are no config parameters for this command. + +Transform strain names +---------------------- + +The ``ingest/vendored/transform-strain-names`` script will verify the ``strain`` field values match an expected pattern. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.strain_regex`` + + * `Python regular expression `_ pattern the strain names must match + + * The default pattern (``^.+$``) accepts any non-empty string because we do not have a clear standard for strain names across pathogens + +* ``curate.strain_backup_fields`` + + * List of other NDJSON fields to use as strain name if the ``strain`` fails to match expected pattern + + * The default list uses the GenBank ``accession`` field as a stable back up field for messy strain fields. + +Format dates +------------ +The :doc:`augur curate format-dates ` command will format date fields to +`ISO 8601 dates `_ (YYYY-MM-DD), where incomplete dates are masked with 'XX' (e.g. 2023 -> 2023-XX-XX). + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.date_fields`` + + * List of NDJSON date fields to be formatted + + * The default list includes the standard date fields that are expected from NCBI records + +* ``curate.expected_date_formats`` + + * List of expected date formats in the provided date fields + + * The default list includes the date formats that are expected from NCBI records + +Transform GenBank location +-------------------------- + +The ``ingest/vendored/transform-genbank-location`` script will try to parse locations in NDJSON records according to +`GenBank country qualifier `_. +It parses the ``location`` field into three fields: + +* ``country`` +* ``division`` +* ``location`` + +Currently, there are no config parameters for this script. + +Titlecase +--------- + +The :doc:`augur curate titlecase ` command will make the first letter of every word +uppercase in provided string fields. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.titlecase.fields`` + + * List of NDJSON fields to titlecase + + * The default list includes all of the geolocation fields from NCBI records (after running ``transform-genbank-location``) + +* ``curate.titlecase.abbreviations`` + + * List of strings to keep as all uppercase + + * The default list includes the country “USA” as an example + +* ``curate.titlecase.articles`` + + * List of strings to keep as all lowercase + + * The default list includes articles (e.g., 'and', 'the', 'of', etc) that we've encountered in past ingest pipelines + +Transform authors +----------------- + +The ``ingest/vendored/transform-authors`` script will abbreviate the authors list in the NDJSON records to `` et al.``. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.authors_field`` + + * The NDJSON field that contains the authors list + + * The default value uses the field expected from NCBI records + +* ``curate.authors_default_value`` + + * The default string to use if the authors list is empty + + * The default value ``?`` will allow you to easily filter for records without authors. + +* ``curate.abbr_authors_field`` + + * The field name to use for the new abbreviated authors field. + + * If none are provided, the original authors field will be replaced with the abbreviated authors. + + * The default field is ``abbr_authors`` so you can compare the original and abbreviated author values. + +Apply geolocation rules +----------------------- + +The ``ingest/vendored/apply-geolocation-rules`` script will apply geolocation standardizations across all records. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.geolocation_rules_url`` + + * The URL for a public set of geolocation rules + + * The default URL points to the `Nextstrain ncov-ingest geolocation rules `_, which is currently the most complete set of geolocation rules. + +* ``curate.local_geolocation_rules`` + + * A path to a local set of geolocation rules used to override the general rules + + * The default points to the empty file ``ingest/defaults/geolocation_rules.tsv`` where you can add your pathogen specific rules + +Geolocation rules +~~~~~~~~~~~~~~~~~ + +Geolocation rules are defined in a TSV file with the format + +.. code-block:: + + region/country/division/location<\t>region/country/division/location + +The first set of locations are the expected geolocations that are in the metadata and the second set of geolocations +after the tab are the standard geolocations that will be applied to the metadata. +Each geo resolution (region, country, division, location) is expected to be a field in the NDJSON. +By using the region/country/division/location hierarchy, we ensure that locations with the same name +(e.g., two cities with the same name but in different countries) are treated differently based on their full hierarchy. +If there are rules that can be applied across multiple locations, then a wildcard (``*``) can be used instead of a specific value. + +Let's say you have the following locations in your NDJSON + +.. code-block:: + + {“region”: “North America”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”} + {“region”: “North America”, “country”: “United States”, “division”: “New York”, “location”: “New York”} + +And you provide these geolocation rules + +.. code-block:: + + North America/United States/New York/New York North America/United States/New York/New York City + North America/United States/New York/* North America/United States/New York State/* + North America/United States/*/* North America/USA/*/* + +The first rule looks for the specific hierarchy to correct the location from “New York” to "New York City". +The second rule has a wildcard as the location, so it will correct all applicable divisions from “New York” to "New York State". +The third rule has wildcards for both division and location, so it will correct all applicable countries from “United States” to "USA". + +Running through the ``ingest/vendored/apply-geolocation-rules`` script should produce the following + +.. code-block:: + + {“region”: “North America”, “country”: “USA”, “division”: “New York State”, “location”: “Buffalo”} + {“region”: “North America”, “country”: “USA”, “division”: “New York State”, “location”: “New York City”} + +Merge user metadata +------------------- + +The ``ingest/vendored/merge-user-metadata`` script merges user curated annotations with the NDJSON records, +with the user curations overwriting the existing fields. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.annotations`` + + * A path to a file of user annotations + * The default points to the empty file ``ingest/defaults/annotations.tsv`` where you can add your pathogen-specific annotations + +* ``curate.annotations_id`` + + * The NDJSON field that has the ID used to match records to annotations + * The default value uses the GenBank ``accession`` since they are guaranteed to be unique + +User annotations +~~~~~~~~~~~~~~~~ + +The user annotations are defined in a TSV file with the format + +.. code-block:: + + id<\t>field<\t>value + +The ``id`` is used to match the NDJSON records. +The ``field`` is the field you are trying to overwrite or add to the NDJSON record. +The ``value`` is the value you are trying to add to the NDJSON record. + +Let's say you have the following NDJSON records + +.. code-block:: + + {“accession”: “AAAAA”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”} + {“accession”: “BBBBB”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”} + +And you provide these user annotations + +.. code-block:: + + AAAAA age 10 + BBBBB age 12 + BBBBB location Niagara Falls + +The first two annotations add the ``age`` field to the records and the +third annotation overwrites the existing ``location`` field for the record ``BBBBB``. + +Running through the ``ingest/vendored/merge-user-metadata`` script should produce the following: + +.. code-block:: + + {“accession”: “AAAAA”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”, “age”: 10} + {“accession”: “BBBBB”, “country”: “United States”, “division”: “New York”, “location”: “Niagara Falls”, “age”: 12} + +Passthru +-------- + +The :doc:`augur curate passthru ` is being used to split the NDJSON records into the +metadata TSV and sequences FASTA files. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.output_id_field`` + + * The NDJSON field to use as the sequence identifiers in the FASTA file + * The default value uses the GenBank ``accession`` since they are guaranteed to be unique + +* ``curate.output_sequence_field`` + + * The NDJSON field that contains the genomic sequence + * The default value uses ``sequence`` which is the field name we use for NCBI Datasets. + +Subset metadata +--------------- + +Finally we use the `tsv-select `_ command +to subset the metadata to a list of metadata columns. + +Config parameters +~~~~~~~~~~~~~~~~~ + +* ``curate.metadata_columns`` + + * A list of metadata columns to include in the final output metadata TSV + * The columns will be output in the order specified + +Advanced usage +============== + +The default ingest workflow of the pathogen-repo-guide is generalized to be able to work with any pathogen, +but this means you will need to tailor the ingest workflow for pathogen specific steps. + +Add custom curation steps +------------------------- + +The curation pipeline is designed to be extremely customizable, with each curation step reading NDJSON records +from stdin and outputing modified NDJSON records to stdout. +If you write a custom script that follows the same pattern, you can add your script as another step anywhere in the +curation pipeline before the final ``augur curate passthru`` command. + +A typical pathogen-specific step for curation is the standardization of strain names since pathogens usually have different naming conventions +(e.g. `influenza `_ vs `measles `_). +For example, we've added a step in the curation pipeline to normalize the strain names for the `Zika ingest workflow `_. + +1. We added a `custom Python script `_ +to the Zika repository which reads NDJSON records from stdin, edits the ``strain`` field per record, then outputs the modified records to stdout. + +2. The script was `added to the curation pipeline `_ +before the ``ingest/vendored/merge-user-metadata`` step to still allow user annotations to override the modified strain names if necessary. + +Nextclade as part of ingest +--------------------------- + +Nextstrain is pushing to standardize our core ingest workflows to include :doc:`Nextclade ` runs, +which allows us to merge clade/lineage designations and QC metrics with the metadata in our publicly hosted data. +However, this is not possible until you have already created a :doc:`Nextclade dataset ` for your pathogen. + +Here's our typical process for adding Nextclade to ingest workflows for new pathogens + +1. Create an ingest workflow without Nextclade. +2. Run the ingest workflow to generate a set of curated metadata and sequences. +3. Use the curated metadata and sequences as input to generate a :doc:`reference tree `. +4. Create a Nextclade dataset by following the `Nextclade dataset creation guide `_. +5. Update the ingest workflow to run Nextclade using the new Nextclade dataset. + +If your pathogen already has a Nextclade dataset, you can use the pathogen-repo-guide's ``ingest/defaults/nextclade_config.yaml`` +config file to include the Nextclade rules from ``ingest/rules/nextclade.smk`` as part of the ingest workflow. + +1. Add your Nextclade dataset name to the ``nextclade.dataset_name`` parameter +2. Run the ingest workflow with the additional config file + +.. code-block:: + + nextstrain build ingest --configfile defaults/nextclade_config.yaml + +Example ingest workflows +======================== + +Although we strive to keep Nextstrain core ingest workflows standardized, we cannot guarantee that every pathogen +ingest workflow will be kept up-to-date. + +We recommend using the `zika ingest workflow `_ and the +`mpox ingest workflow `_ as example workflows that +demonstrate our latest developments. + +Next steps +========== + +* Learn more about :doc:`augur curate commands ` +* We are planning to write another detailed tutorial for creating a phylogenetic workflow, + but until that is ready you can follow the :doc:`simple phylogenetic workflow tutorial <../creating-a-phylogenetic-workflow>`. diff --git a/src/tutorials/creating-a-pathogen-repo/index.rst b/src/tutorials/creating-a-pathogen-repo/index.rst new file mode 100644 index 00000000..f522e00a --- /dev/null +++ b/src/tutorials/creating-a-pathogen-repo/index.rst @@ -0,0 +1,24 @@ +============================== +Creating a pathogen repository +============================== + +This series of tutorials explains how to create a Nextstrain :term:`pathogen repository`. +These tutorials explore creating workflows in Nextstrain pathogen repositories +that are more complex than the `example Zika workflow `_ +covered in the :doc:`earlier tutorial<../creating-a-phylogenetic-workflow>`. + +.. note:: + + You only need to create a new pathogen repository if it does not already exist or + you do not want to use an existing Nextstrain pathogen repository. + +The tutorials use the Nextstrain `pathogen-repo-guide `_ +as the starting point for creating a pathogen repository. Each one covers an individual :term:`workflow` +within the repository. + +.. toctree:: + :maxdepth: 2 + :titlesonly: + :caption: Table of contents + + creating-an-ingest-workflow diff --git a/src/tutorials/using-a-pathogen-repo/index.rst b/src/tutorials/using-a-pathogen-repo/index.rst new file mode 100644 index 00000000..c632109b --- /dev/null +++ b/src/tutorials/using-a-pathogen-repo/index.rst @@ -0,0 +1,20 @@ +=========================== +Using a pathogen repository +=========================== + +This series of tutorials explains how to use an existing Nextstrain +:term:`pathogen repositories `. These tutorials explore +running workflows in Nextstrain pathogen repositories that are more complex than +the `example Zika workflow `_ +covered in the :doc:`earlier tutorial<../running-a-phylogenetic-workflow>`. + +The tutorials will use the Nextstrain `Zika repository `_ as +an example pathogen repository and each one covers an individual :term:`workflow` +within the repository. + +.. toctree:: + :maxdepth: 2 + :titlesonly: + :caption: Table of contents + + running-an-ingest-workflow diff --git a/src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst b/src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst new file mode 100644 index 00000000..1e6196d2 --- /dev/null +++ b/src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst @@ -0,0 +1,272 @@ +========================== +Running an ingest workflow +========================== + +This tutorial uses the :term:`Nextstrain CLI` to help you get started running :term:`ingest workflows`. +Ingest workflows download public data from NCBI and output :term:`ingest datasets `, which include curated +metadata and sequences that can be used as input for :term:`phylogenetic ` or :term:`Nextclade workflows `. + +.. note:: + You only need to run an ingest workflow if you do not want to use the data files already publicly hosted by Nextstrain. + Individual pathogen repositories include documentation that links to their data files. + +In this tutorial, you will run the ingest workflow of our `Zika repository `_ and view outputs on your computer. +You will have a basic understanding of how to run ingest workflows for other pathogens and a foundation for understanding how to customize ingest workflows. + +.. contents:: Table of Contents + :local: + +Prerequisites +============= + +1. :doc:`Install Nextstrain `. These instructions will install all of the software you need to complete this tutorial. + +Download the Zika repository +============================ + +All pathogen ingest workflows are stored in :term:`pathogen repositories` (version-controlled folders) to track changes over time. +Download the `Zika repository `_. + +.. code-block:: + + $ git clone https://github.com/nextstrain/zika + Cloning into 'zika'... + [...more output...] + +When it's done, you'll have a new directory called ``zika/``. + +Run the default workflow +======================== + +The zika :term:`ingest workflow` uses the `NCBI Datasets CLI tools `_ +to download public data and uses a combination of :doc:`augur curate ` and other data manipulation tools to curate +the downloaded data into a format suitable for :term:`phylogenetic workflows `. + +1. Change directory to the Zika pathogen repository downloaded in the previous step + +.. code-block:: + + $ cd zika + +2. Run the default ingest workflow with the :term:`Nextstrain CLI`. + +.. code-block:: + + $ nextstrain build ingest + Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. + Building DAG of jobs… + [...a lot of output...] + +This should take just a few minutes to complete. +There should be two final output files: + +* ``ingest/results/metadata.tsv`` +* ``ingest/results/sequences.fasta`` + +The output files should have the same data formats as the public data files hosted by Nextstrain, available at: + +* https://data.nextstrain.org/files/workflows/zika/metadata.tsv.zst +* https://data.nextstrain.org/files/workflows/zika/sequences.fasta.zst + +Your results may have additional records depending on whether new data has been released since the public data files were last uploaded. + + +Configuring the ingest workflow +=============================== + +Now that you've seen the default outputs of the ingest workflow, you can try configuring the ingest workflow to change the outputs. + +Inspecting the uncurated metadata +--------------------------------- + +If you want to see the uncurated NCBI Datasets data to decide what changes you would like to make to the workflow, +you can download the uncurated NCBI data. + +.. hint:: + + These commands are very similar to the commands run by the ingest workflow with some minor differences. + The ingest workflow restricts the columns to those defined in ``config["ncbi_datasets_fields"]`` + and keeps the header names as the more computer friendly "Mnemonic" of the + `NCBI Datasets' available fields `_. + +1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately. + +.. code-block:: + + $ nextstrain shell . + +2. Create the ``ingest/data`` directory if it doesn't already exist. + +.. code-block:: + + $ mkdir -p ingest/data + +3. Download the dataset with the pathogen NCBI taxonomy ID. + +.. code-block:: + + $ datasets download virus genome taxon \ + --filename ingest/data/ncbi_dataset.zip + +4. Extract and format the metadata as a TSV file for easy inspection + +.. code-block:: + + $ dataformat tsv virus-genome \ + --package ingest/data/ncbi_dataset.zip \ + > ingest/data/raw_metadata.tsv + +5. Exit the Nextstrain shell to return to your usual shell environment. + +.. code-block:: + + $ exit + +The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets. + +Updating the workflow config +---------------------------- + +We'll walk through an example custom config to include an additional column in the curated output. +For example, examining the raw NCBI metadata shows us that ``virus-name`` is a NCBI Datasets field that is not currently downloaded by the default Zika ingest workflow. +If you wanted this field to be included in your outputs, you could perform the following steps. + +1. Create a new build config directory ``ingest/build-configs/tutorial/`` + +.. code-block:: + + $ mkdir ingest/build-configs/tutorial + +2. Copy the default config to ``ingest/build-configs/tutorial/config.yaml`` + +.. code-block:: + + $ cp ingest/defaults/config.yaml ingest/build-configs/tutorial/config.yaml + +3. Modify the config parameters within your new custom config ``ingest/build-configs/tutorial/config.yaml``. + +* Add ``virus-name`` to the ``ncbi_datasets_fields`` to make the workflow parse the column from the downloaded NCBI data. +* Update the ``curate.field_map`` with an entry for the new field to match the underscore naming scheme of column names. + + .. code-block:: yaml + + curate: + field_map: + virus-name: virus_name + +* Add ``virus_name`` to the ``curate.metadata_columns`` to configure the workflow to include the new column in the final output file. +* (Optional) Remove any other config parameters that you are not modifying + +.. note:: + + Config parameters that are dictionaries will merge with the parameters defined in ``ingest/defaults/config.yaml`` + while all other types will overwrite the default. + See `Snakemake documentation `_ for more details on how configuration files work. + +All config parameters available are listed in the ``ingest/defaults/config.yaml`` file. +Any of the config parameters can be overridden in a custom config file. + +4. Run the ingest workflow again with the custom config file. + +.. code-block:: + + $ nextstrain build ingest --configfile build-configs/tutorial/config.yaml --forceall + Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. + Config file defaults/config.yaml is extended by additional config specified via the command line. + Building DAG of jobs… + [...a lot of output...] + +5. Inspect the new ``ingest/results/metadata.tsv`` to see that it now includes the additional ``virus_name`` column. + +Advanced usage: Customizing the ingest workflow +=============================================== + +.. note:: + + This section of the tutorial requires an understanding of `Snakemake `_ workflows. + +In addition to configuring the ingest workflow, it is also possible to extend the ingest workflow with your own custom steps. +We'll walk through an example customization that joins additional metadata to the public data that you've curated in the previous steps. + +1. Create an additional metadata file ``ingest/build-configs/tutorial/additional-metadata.tsv`` + +.. code-block:: + + genbank_accession column_A column_B column_C + AF013415 AAAAA BBBBB CCCCC + AF372422 AAAAA BBBBB CCCCC + AY326412 AAAAA BBBBB CCCCC + AY632535 AAAAA BBBBB CCCCC + EU303241 AAAAA BBBBB CCCCC + EU074027 AAAAA BBBBB CCCCC + EU545988 AAAAA BBBBB CCCCC + NC_012532 AAAAA BBBBB CCCCC + DQ859059 AAAAA BBBBB CCCCC + JN860885 AAAAA BBBBB CCCCC + + +2. Create a new rules file ``ingest/build-configs/tutorial/merge-metadata.smk`` + +.. code-block:: + + rule merge_metadata: + input: + metadata="results/metadata.tsv", + additional_metadata="build-configs/tutorial/additional-metadata.tsv", + output: + merged_metadata="results/merged-metadata.tsv" + shell: + """ + tsv-join -H \ + --filter-file {input.additional_metadata} \ + --key-fields "genbank_accession" \ + --append-fields "*" \ + --write-all "?" \ + {input.metadata} > {output.merged_metadata} + """ + +This rule uses `tsv-join `_ to merge the +additional metadata with the metadata output from the ingest workflow. +The records will be merged using the ``genbank_accession`` column and all fields from the ``additional-metadata.tsv`` +file will be appended to the metadata. +Any record in the ``metadata.tsv`` that does not have a matching record in the ``additional-metadata.tsv`` will have a +default ``?`` value in the new columns. + +3. Add the following to the custom config file ``ingest/build-configs/tutorial/config.yaml`` + +.. code-block:: + + custom_rules: + - build-configs/tutorial/merge-metadata.smk + +The ``custom_rules`` config tells the ingest workflow to include your custom rules so that you can run them as part of the workflow. + +4. Run the ingest workflow again with the customized rule. + +.. code-block:: + + $ nextstrain build ingest merge_metadata --configfile build-configs/tutorial/config.yaml + Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments. + Config file config/defaults.yaml is extended by additional config specified via the command line. + Building DAG of jobs... + [...a lot of output...] + +5. Inspect the ``ingest/results/merged-metadata.tsv`` file to see that it includes the additional columns ``column_A``, ``column_B``, and ``column_C``. +The records with the ``genbank_accession`` listed in the ``additional-metadata.tsv`` file should have the placeholder data in the new columns, +while other records should have the default ``?`` value. + +Next steps +========== + +* Run the `zika phylogenetic workflow `_ with new ingested data as input + by running + + .. code-block:: + + $ mv ingest/results/* phylogenetic/data/ + $ nextstrain build phylogenetic + + If you've customized the ingest workflow then you may need to modify the phylogenetic workflow to use the new ingested data file. + We are planning to write another tutorial to cover other modifications to your phylogenetic workflow. + +* :doc:`Learn how to create an ingest workflow `