Merge pull request #10 from nextstrain/add-ingest

Add ingest
nextstrain · Feb 14, 2024 · 75df4da · 75df4da
2 parents 17dbd67 + e766152
commit 75df4da
Show file tree

Hide file tree

Showing 35 changed files with 1,729 additions and 0 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,2 +1,3 @@
 # CHANGELOG
 * 11 January 2024: Use a config file to define hardcoded parameters and file paths, add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9)
+* 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)
diff --git a/ingest/README.md b/ingest/README.md
@@ -0,0 +1,43 @@
+# Ingest
+
+This workflow ingests public data from NCBI and outputs curated metadata and
+sequences that can be used as input for the phylogenetic workflow.
+
+If you have another data source or private data that needs to be formatted for
+the phylogenetic workflow, then you can use a similar workflow to curate your
+own data.
+
+## Run
+
+From within the `ingest` directory, run the workflow with:
+
+```
+nextstrain build .
+```
+
+This produces a `results` directory with the following outputs:
+- sequences.fasta
+- metadata.tsv
+
+## Defaults
+
+The defaults directory contains all of the default configurations for the ingest workflow.
+
+[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters
+used for the ingest workflow. Use Snakemake's `--configfile`/`--config`
+options to override these default values.
+
+## Snakefile and rules
+
+The rules directory contains separate Snakefiles (`*.smk`) as modules of the core ingest workflow.
+The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized.
+Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
+in the main Snakefile in the order that they are expected to run.
+
+## Vendored
+
+This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo)
+to manage copies of ingest scripts in [vendored](vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest).
+
+See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update
+the vendored scripts.
diff --git a/ingest/Snakefile b/ingest/Snakefile
@@ -0,0 +1,44 @@
+"""
+This is the main ingest Snakefile that orchestrates the full ingest workflow
+and defines its default outputs.
+"""
+
+# Use default configuration values. Override with Snakemake's --configfile/--config options.
+configfile: "defaults/config.yaml"
+
+# This is the default rule that Snakemake will run when there are no specified targets.
+# The default output of the ingest workflow is usually the curated metadata and sequences.
+# Nextstrain maintained ingest workflows will produce metadata files with the
+# standard Nextstrain fields and additional fields that are pathogen specific.
+# We recommend use these standard fields in custom ingests as well to minimize
+# the customizations you will need for the downstream phylogenetic workflow.
+# TODO: Add link to centralized docs on standard Nextstrain metadata fields
+rule all:
+    input:
+        "results/sequences.fasta",
+        "results/metadata.tsv",
+
+
+# Note that only PATHOGEN level customizations should be added to these
+# core steps, meaning they are custom rules necessary for all builds of the pathogen.
+# If there are build specific customizations, they should be added with the
+# custom_rules imported below to ensure that the core workflow is not complicated
+# by build specific rules.
+include: "rules/fetch_from_ncbi.smk"
+include: "rules/curate.smk"
+
+
+# Allow users to import custom rules provided via the config.
+# This allows users to run custom rules that can extend or override the workflow.
+# A concrete example of using custom rules is the extension of the workflow with
+# rules to support the Nextstrain automation that upload files and send internal
+# Slack notifications.
+# For extensions, the user will have to specify the custom rule targets when
+# running the workflow.
+# For overrides, the custom Snakefile will have to use the `ruleorder` directive
+# to allow Snakemake to handle ambiguous rules
+# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
+if "custom_rules" in config:
+    for rule_file in config["custom_rules"]:
+
+        include: rule_file
diff --git a/ingest/defaults/annotations.tsv b/ingest/defaults/annotations.tsv
@@ -0,0 +1,6 @@
+# Manually curated annotations TSV file
+# The TSV should not have a header and should have exactly three columns:
+# id to match existing metadata, field name, and field value
+# If there are multiple annotations for the same id and field, then the last value is used
+# Lines starting with '#' are treated as comments
+# Any '#' after the field value are treated as comments.
diff --git a/ingest/defaults/config.yaml b/ingest/defaults/config.yaml
@@ -0,0 +1,119 @@
+# This configuration file should contain all required configuration parameters
+# for the ingest workflow to run to completion.
+#
+# Define optional config parameters with their default values here so that users
+# do not have to dig through the workflows to figure out the default values
+
+
+# Required to fetch from NCBI Datasets
+ncbi_taxon_id: "11234"
+
+# The list of NCBI Datasets fields to include from NCBI Datasets output
+# These need to be the mneumonics of the NCBI Datasets fields, see docs for full list of fields
+# https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
+# Note: the "accession" field MUST be provided to match with the sequences
+ncbi_datasets_fields:
+  - accession
+  - sourcedb
+  - sra-accs
+  - isolate-lineage
+  - geo-region
+  - geo-location
+  - isolate-collection-date
+  - release-date
+  - update-date
+  - length
+  - host-name
+  - isolate-lineage-source
+  - biosample-acc
+  - submitter-names
+  - submitter-affiliation
+  - submitter-country
+  - virus-name
+
+# Config parameters related to the curate pipeline
+curate:
+  # URL pointed to public generalized geolocation rules
+  # For the Nextstrain team, this is currently
+  # 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv'
+  geolocation_rules_url: "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv"
+  # The path to the local geolocation rules within the pathogen repo
+  # The path should be relative to the ingest directory.
+  local_geolocation_rules: "defaults/geolocation_rules.tsv"
+  # List of field names to change where the key is the original field name and the value is the new field name
+  # The original field names should match the ncbi_datasets_fields provided above.
+  # This is the first step in the pipeline, so any references to field names in the configs below should use the new field names
+  field_map:
+    accession: accession
+    sourcedb: database
+    sra-accs: sra_accessions
+    isolate-lineage: strain
+    geo-region: region
+    geo-location: location
+    isolate-collection-date: date
+    release-date: date_released
+    update-date: date_updated
+    length: length
+    host-name: host
+    isolate-lineage-source: sample_type
+    biosample-acc: biosample_accessions
+    submitter-names: authors
+    submitter-affiliation: institution
+    submitter-country: submitter_country
+    virus-name: virus_name
+  # Standardized strain name regex
+  # Currently accepts any characters because we do not have a clear standard for strain names across pathogens
+  strain_regex: '^.+$'
+  # Back up strain name field to use if 'strain' doesn't match regex above
+  strain_backup_fields: []
+  # List of date fields to standardize to ISO format YYYY-MM-DD
+  date_fields: ['date', 'date_released', 'date_updated']
+  # List of expected date formats that are present in the date fields provided above
+  # These date formats should use directives expected by datetime
+  # See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
+  expected_date_formats: ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ']
+  titlecase:
+    # List of string fields to titlecase
+    fields: ['region', 'country', 'division', 'location']
+    # List of abbreviations not cast to titlecase, keeps uppercase
+    abbreviations: ['USA']
+    # Articles that should not be cast to titlecase
+    articles: [
+      'and', 'd', 'de', 'del', 'des', 'di', 'do', 'en', 'l', 'la', 'las', 'le',
+      'los', 'nad', 'of', 'op', 'sur', 'the', 'y'
+    ]
+  # Metadata field that contains the list of authors associated with the sequence
+  authors_field: "authors"
+  # Default value to use if the authors field is empty
+  authors_default_value: "?"
+  # Name to use for the generated abbreviated authors field
+  abbr_authors_field: "abbr_authors"
+  # Path to the manual annotations file
+  # The path should be relative to the ingest directory
+  annotations: "defaults/annotations.tsv"
+  # The ID field in the metadata to use to merge the manual annotations
+  annotations_id: "accession"
+  # The ID field in the metadata to use as the sequence id in the output FASTA file
+  output_id_field: "accession"
+  # The field in the NDJSON record that contains the actual genomic sequence
+  output_sequence_field: "sequence"
+  # The list of metadata columns to keep in the final output of the curation pipeline.
+  metadata_columns: [
+    'accession',
+    'strain',
+    'virus_name',
+    'date',
+    'region',
+    'country',
+    'division',
+    'location',
+    'length',
+    'host',
+    'date_released',
+    'date_updated',
+    'sra_accessions',
+    'authors',
+    'abbr_authors',
+    'institution',
+  ]
+
diff --git a/ingest/defaults/geolocation_rules.tsv b/ingest/defaults/geolocation_rules.tsv
@@ -0,0 +1,6 @@
+# TSV file of geolocation rules with the format:
+# '<raw_geolocation><tab><annotated_geolocation>' where the raw and annotated geolocations
+# are formatted as '<region>/<country>/<division>/<location>'
+# If creating a general rule, then the raw field value can be substituted with '*'
+# Lines starting with '#' will be ignored as comments.
+# Trailing '#' will be ignored as comments.
diff --git a/ingest/rules/curate.smk b/ingest/rules/curate.smk
@@ -0,0 +1,130 @@
+"""
+This part of the workflow handles the curation of data from NCBI
+
+REQUIRED INPUTS:
+
+    ndjson      = data/ncbi.ndjson
+
+OUTPUTS:
+
+    metadata    = results/subset_metadata.tsv
+    sequences   = results/sequences.fasta
+
+"""
+
+
+# The following two rules can be ignored if you choose not to use the
+# generalized geolocation rules that are shared across pathogens.
+# The Nextstrain team will try to maintain a generalized set of geolocation
+# rules that can then be overridden by local geolocation rules per pathogen repo.
+rule fetch_general_geolocation_rules:
+    output:
+        general_geolocation_rules="data/general-geolocation-rules.tsv",
+    params:
+        geolocation_rules_url=config["curate"]["geolocation_rules_url"],
+    shell:
+        """
+        curl {params.geolocation_rules_url} > {output.general_geolocation_rules}
+        """
+
+
+rule concat_geolocation_rules:
+    input:
+        general_geolocation_rules="data/general-geolocation-rules.tsv",
+        local_geolocation_rules=config["curate"]["local_geolocation_rules"],
+    output:
+        all_geolocation_rules="data/all-geolocation-rules.tsv",
+    shell:
+        """
+        cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}
+        """
+
+
+def format_field_map(field_map: dict[str, str]) -> str:
+    """
+    Format dict to `"key1"="value1" "key2"="value2"...` for use in shell commands.
+    """
+    return " ".join([f'"{key}"="{value}"' for key, value in field_map.items()])
+
+
+# This curate pipeline is based on existing pipelines for pathogen repos using NCBI data.
+# You may want to add and/or remove steps from the pipeline for custom metadata
+# curation for your pathogen. Note that the curate pipeline is streaming NDJSON
+# records between scripts, so any custom scripts added to the pipeline should expect
+# the input as NDJSON records from stdin and output NDJSON records to stdout.
+# The final step of the pipeline should convert the NDJSON records to two
+# separate files: a metadata TSV and a sequences FASTA.
+rule curate:
+    input:
+        sequences_ndjson="data/ncbi.ndjson",
+        # Change the geolocation_rules input path if you are removing the above two rules
+        all_geolocation_rules="data/all-geolocation-rules.tsv",
+        annotations=config["curate"]["annotations"],
+    output:
+        metadata="results/all_metadata.tsv",
+        sequences="results/sequences.fasta",
+    log:
+        "logs/curate.txt",
+    benchmark:
+        "benchmarks/curate.txt"
+    params:
+        field_map=format_field_map(config["curate"]["field_map"]),
+        strain_regex=config["curate"]["strain_regex"],
+        strain_backup_fields=config["curate"]["strain_backup_fields"],
+        date_fields=config["curate"]["date_fields"],
+        expected_date_formats=config["curate"]["expected_date_formats"],
+        articles=config["curate"]["titlecase"]["articles"],
+        abbreviations=config["curate"]["titlecase"]["abbreviations"],
+        titlecase_fields=config["curate"]["titlecase"]["fields"],
+        authors_field=config["curate"]["authors_field"],
+        authors_default_value=config["curate"]["authors_default_value"],
+        abbr_authors_field=config["curate"]["abbr_authors_field"],
+        annotations_id=config["curate"]["annotations_id"],
+        id_field=config["curate"]["output_id_field"],
+        sequence_field=config["curate"]["output_sequence_field"],
+    shell:
+        """
+        (cat {input.sequences_ndjson} \
+            | ./vendored/transform-field-names \
+                --field-map {params.field_map} \
+            | augur curate normalize-strings \
+            | ./vendored/transform-strain-names \
+                --strain-regex {params.strain_regex} \
+                --backup-fields {params.strain_backup_fields} \
+            | augur curate format-dates \
+                --date-fields {params.date_fields} \
+                --expected-date-formats {params.expected_date_formats} \
+            | ./vendored/transform-genbank-location \
+            | augur curate titlecase \
+                --titlecase-fields {params.titlecase_fields} \
+                --articles {params.articles} \
+                --abbreviations {params.abbreviations} \
+            | ./vendored/transform-authors \
+                --authors-field {params.authors_field} \
+                --default-value {params.authors_default_value} \
+                --abbr-authors-field {params.abbr_authors_field} \
+            | ./vendored/apply-geolocation-rules \
+                --geolocation-rules {input.all_geolocation_rules} \
+            | ./vendored/merge-user-metadata \
+                --annotations {input.annotations} \
+                --id-field {params.annotations_id} \
+            | augur curate passthru \
+                --output-metadata {output.metadata} \
+                --output-fasta {output.sequences} \
+                --output-id-field {params.id_field} \
+                --output-seq-field {params.sequence_field} ) 2>> {log}
+        """
+
+
+rule subset_metadata:
+    input:
+        metadata="results/all_metadata.tsv",
+    output:
+        subset_metadata="results/metadata.tsv",
+    params:
+        metadata_fields=",".join(config["curate"]["metadata_columns"]),
+    shell:
+        """
+        tsv-select -H -f {params.metadata_fields} \
+            {input.metadata} > {output.subset_metadata}
+        """