Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest #10

Merged
merged 17 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# CHANGELOG
* 11 January 2024: Use a config file to define hardcoded parameters and file paths, add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9)
* 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)
43 changes: 43 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Ingest
jameshadfield marked this conversation as resolved.
Show resolved Hide resolved

This workflow ingests public data from NCBI and outputs curated metadata and
sequences that can be used as input for the phylogenetic workflow.

If you have another data source or private data that needs to be formatted for
the phylogenetic workflow, then you can use a similar workflow to curate your
own data.

## Run

From within the `ingest` directory, run the workflow with:

```
nextstrain build .
joverlee521 marked this conversation as resolved.
Show resolved Hide resolved
```

This produces a `results` directory with the following outputs:
- sequences.fasta
- metadata.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also includes all_metadata.tsv.

In phylo workflows we have {data,results,auspice} and it's understood that the files in results can be numerous and change frequently with workflow updates. For ingest we only have {data,results}. My understanding is that more complex ingest workflows will populate results with many files. So maybe we could change the wording here to indicate that of the files in results these two are the ones that should be used for downstream analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the all_metadata.tsv looks like an intermediate data file, not the final result metadata.tsv.

@kimandrews, you can change the path from results/all_metadata.tsv to data/all_metadata.tsv so there's less confusion on what the final output files should be (e.g. results/sequences.fasta and results/metadata.tsv.

metadata="results/all_metadata.tsv",

rule subset_metadata:
input:
metadata="results/all_metadata.tsv",
output:
subset_metadata="results/metadata.tsv",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f3fe0c1


## Defaults

The defaults directory contains all of the default configurations for the ingest workflow.

[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters
used for the ingest workflow. Use Snakemake's `--configfile`/`--config`
options to override these default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules of the core ingest workflow.
The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized.
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

## Vendored

This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo)
to manage copies of ingest scripts in [vendored](vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest).

See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update
the vendored scripts.
44 changes: 44 additions & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""
This is the main ingest Snakefile that orchestrates the full ingest workflow
and defines its default outputs.
"""

# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

# This is the default rule that Snakemake will run when there are no specified targets.
# The default output of the ingest workflow is usually the curated metadata and sequences.
# Nextstrain maintained ingest workflows will produce metadata files with the
# standard Nextstrain fields and additional fields that are pathogen specific.
# We recommend use these standard fields in custom ingests as well to minimize
# the customizations you will need for the downstream phylogenetic workflow.
# TODO: Add link to centralized docs on standard Nextstrain metadata fields
rule all:
input:
"results/sequences.fasta",
"results/metadata.tsv",


# Note that only PATHOGEN level customizations should be added to these
# core steps, meaning they are custom rules necessary for all builds of the pathogen.
# If there are build specific customizations, they should be added with the
# custom_rules imported below to ensure that the core workflow is not complicated
# by build specific rules.
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"


# Allow users to import custom rules provided via the config.
# This allows users to run custom rules that can extend or override the workflow.
# A concrete example of using custom rules is the extension of the workflow with
# rules to support the Nextstrain automation that upload files and send internal
# Slack notifications.
# For extensions, the user will have to specify the custom rule targets when
# running the workflow.
# For overrides, the custom Snakefile will have to use the `ruleorder` directive
# to allow Snakemake to handle ambiguous rules
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
6 changes: 6 additions & 0 deletions ingest/defaults/annotations.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Manually curated annotations TSV file
# The TSV should not have a header and should have exactly three columns:
# id to match existing metadata, field name, and field value
# If there are multiple annotations for the same id and field, then the last value is used
# Lines starting with '#' are treated as comments
# Any '#' after the field value are treated as comments.
119 changes: 119 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# This configuration file should contain all required configuration parameters
# for the ingest workflow to run to completion.
#
# Define optional config parameters with their default values here so that users
# do not have to dig through the workflows to figure out the default values


# Required to fetch from NCBI Datasets
ncbi_taxon_id: "11234"

# The list of NCBI Datasets fields to include from NCBI Datasets output
# These need to be the mneumonics of the NCBI Datasets fields, see docs for full list of fields
# https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
# Note: the "accession" field MUST be provided to match with the sequences
ncbi_datasets_fields:
- accession
- sourcedb
- sra-accs
- isolate-lineage
- geo-region
- geo-location
- isolate-collection-date
- release-date
- update-date
- length
- host-name
- isolate-lineage-source
- biosample-acc
- submitter-names
- submitter-affiliation
- submitter-country
- virus-name

# Config parameters related to the curate pipeline
curate:
# URL pointed to public generalized geolocation rules
# For the Nextstrain team, this is currently
# 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv'
geolocation_rules_url: "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv"
# The path to the local geolocation rules within the pathogen repo
# The path should be relative to the ingest directory.
local_geolocation_rules: "defaults/geolocation_rules.tsv"
# List of field names to change where the key is the original field name and the value is the new field name
# The original field names should match the ncbi_datasets_fields provided above.
# This is the first step in the pipeline, so any references to field names in the configs below should use the new field names
field_map:
accession: accession
sourcedb: database
sra-accs: sra_accessions
isolate-lineage: strain
geo-region: region
geo-location: location
isolate-collection-date: date
release-date: date_released
update-date: date_updated
length: length
host-name: host
isolate-lineage-source: sample_type
biosample-acc: biosample_accessions
submitter-names: authors
submitter-affiliation: institution
submitter-country: submitter_country
virus-name: virus_name
# Standardized strain name regex
# Currently accepts any characters because we do not have a clear standard for strain names across pathogens
strain_regex: '^.+$'
# Back up strain name field to use if 'strain' doesn't match regex above
strain_backup_fields: []
# List of date fields to standardize to ISO format YYYY-MM-DD
date_fields: ['date', 'date_released', 'date_updated']
# List of expected date formats that are present in the date fields provided above
# These date formats should use directives expected by datetime
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
expected_date_formats: ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ']
titlecase:
# List of string fields to titlecase
fields: ['region', 'country', 'division', 'location']
# List of abbreviations not cast to titlecase, keeps uppercase
abbreviations: ['USA']
# Articles that should not be cast to titlecase
articles: [
'and', 'd', 'de', 'del', 'des', 'di', 'do', 'en', 'l', 'la', 'las', 'le',
'los', 'nad', 'of', 'op', 'sur', 'the', 'y'
]
# Metadata field that contains the list of authors associated with the sequence
authors_field: "authors"
# Default value to use if the authors field is empty
authors_default_value: "?"
# Name to use for the generated abbreviated authors field
abbr_authors_field: "abbr_authors"
# Path to the manual annotations file
# The path should be relative to the ingest directory
annotations: "defaults/annotations.tsv"
# The ID field in the metadata to use to merge the manual annotations
annotations_id: "accession"
# The ID field in the metadata to use as the sequence id in the output FASTA file
output_id_field: "accession"
# The field in the NDJSON record that contains the actual genomic sequence
output_sequence_field: "sequence"
# The list of metadata columns to keep in the final output of the curation pipeline.
metadata_columns: [
'accession',
'strain',
'virus_name',
'date',
'region',
'country',
'division',
'location',
'length',
'host',
'date_released',
'date_updated',
'sra_accessions',
'authors',
'abbr_authors',
'institution',
]

6 changes: 6 additions & 0 deletions ingest/defaults/geolocation_rules.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# TSV file of geolocation rules with the format:
# '<raw_geolocation><tab><annotated_geolocation>' where the raw and annotated geolocations
# are formatted as '<region>/<country>/<division>/<location>'
# If creating a general rule, then the raw field value can be substituted with '*'
# Lines starting with '#' will be ignored as comments.
# Trailing '#' will be ignored as comments.
130 changes: 130 additions & 0 deletions ingest/rules/curate.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
"""
This part of the workflow handles the curation of data from NCBI

REQUIRED INPUTS:

ndjson = data/ncbi.ndjson

OUTPUTS:

metadata = results/subset_metadata.tsv
sequences = results/sequences.fasta

"""


# The following two rules can be ignored if you choose not to use the
# generalized geolocation rules that are shared across pathogens.
# The Nextstrain team will try to maintain a generalized set of geolocation
# rules that can then be overridden by local geolocation rules per pathogen repo.
rule fetch_general_geolocation_rules:
output:
general_geolocation_rules="data/general-geolocation-rules.tsv",
params:
geolocation_rules_url=config["curate"]["geolocation_rules_url"],
shell:
"""
curl {params.geolocation_rules_url} > {output.general_geolocation_rules}
"""


rule concat_geolocation_rules:
input:
general_geolocation_rules="data/general-geolocation-rules.tsv",
local_geolocation_rules=config["curate"]["local_geolocation_rules"],
output:
all_geolocation_rules="data/all-geolocation-rules.tsv",
shell:
"""
cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}
"""


def format_field_map(field_map: dict[str, str]) -> str:
"""
Format dict to `"key1"="value1" "key2"="value2"...` for use in shell commands.
"""
return " ".join([f'"{key}"="{value}"' for key, value in field_map.items()])


# This curate pipeline is based on existing pipelines for pathogen repos using NCBI data.
# You may want to add and/or remove steps from the pipeline for custom metadata
# curation for your pathogen. Note that the curate pipeline is streaming NDJSON
# records between scripts, so any custom scripts added to the pipeline should expect
# the input as NDJSON records from stdin and output NDJSON records to stdout.
# The final step of the pipeline should convert the NDJSON records to two
# separate files: a metadata TSV and a sequences FASTA.
rule curate:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following an idea to implement later, it doesn't have to be part of this PR

The virus name column contains information the phylo build will certainly want to show. It may even be the default colouring in auspice (or may be used as the basis for nextclade clade assignment which will then become the default colouring). The structure of values here is diverse, but the majority follow a straightforward pattern. Using cat results/metadata.tsv | cut -f 3 | sort | uniq -c | sort -r the top 10 are:

6586 Measles morbillivirus
6557 Measles virus genotype D8
4883 Measles virus genotype B3
1546 Measles virus genotype D4
1341 Measles virus genotype H1
 668 Measles virus genotype D9
 158 Measles virus genotype H1a
 137 Measles virus genotype A
 131 Measles virus genotype D5
  72 Measles virus genotype B2

I don't think it'd be too hard to write a small script to extract as many genotypes as possible from this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[also non-blocking comment]

I concur with @jameshadfield regarding the inclusion of genotype as a user-desired feature to display, and a first step for creating a nextclade dataset. (exciting! :D ) You can open an issue titled "Parse genotype from NCBI data." Once initiated, you can proceed by creating a distinct branch and subsequent pull request to maintain a focused scope (and focused review process :) ).

To address this task, you have the option to either develop a new script from the ground up or to adapt existing scripts such as fix-zika-strain-names.py or vendored/transform-authors. These scripts already read in from ndjson, but you'd have to modify it to parse virus-name (or virus_name) and infer a new "genotype" field, and invoke it somewhere in the curate rule. I'm happy to discuss this when you get to this task or if you have any questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue for this here

input:
sequences_ndjson="data/ncbi.ndjson",
# Change the geolocation_rules input path if you are removing the above two rules
all_geolocation_rules="data/all-geolocation-rules.tsv",
annotations=config["curate"]["annotations"],
output:
metadata="results/all_metadata.tsv",
sequences="results/sequences.fasta",
log:
"logs/curate.txt",
benchmark:
"benchmarks/curate.txt"
params:
field_map=format_field_map(config["curate"]["field_map"]),
strain_regex=config["curate"]["strain_regex"],
strain_backup_fields=config["curate"]["strain_backup_fields"],
date_fields=config["curate"]["date_fields"],
expected_date_formats=config["curate"]["expected_date_formats"],
articles=config["curate"]["titlecase"]["articles"],
abbreviations=config["curate"]["titlecase"]["abbreviations"],
titlecase_fields=config["curate"]["titlecase"]["fields"],
authors_field=config["curate"]["authors_field"],
authors_default_value=config["curate"]["authors_default_value"],
abbr_authors_field=config["curate"]["abbr_authors_field"],
annotations_id=config["curate"]["annotations_id"],
id_field=config["curate"]["output_id_field"],
sequence_field=config["curate"]["output_sequence_field"],
shell:
"""
(cat {input.sequences_ndjson} \
| ./vendored/transform-field-names \
--field-map {params.field_map} \
| augur curate normalize-strings \
| ./vendored/transform-strain-names \
--strain-regex {params.strain_regex} \
--backup-fields {params.strain_backup_fields} \
| augur curate format-dates \
--date-fields {params.date_fields} \
--expected-date-formats {params.expected_date_formats} \
| ./vendored/transform-genbank-location \
| augur curate titlecase \
--titlecase-fields {params.titlecase_fields} \
--articles {params.articles} \
--abbreviations {params.abbreviations} \
| ./vendored/transform-authors \
--authors-field {params.authors_field} \
--default-value {params.authors_default_value} \
--abbr-authors-field {params.abbr_authors_field} \
| ./vendored/apply-geolocation-rules \
--geolocation-rules {input.all_geolocation_rules} \
| ./vendored/merge-user-metadata \
--annotations {input.annotations} \
--id-field {params.annotations_id} \
| augur curate passthru \
--output-metadata {output.metadata} \
--output-fasta {output.sequences} \
--output-id-field {params.id_field} \
--output-seq-field {params.sequence_field} ) 2>> {log}
"""


rule subset_metadata:
input:
metadata="results/all_metadata.tsv",
output:
subset_metadata="results/metadata.tsv",
params:
metadata_fields=",".join(config["curate"]["metadata_columns"]),
shell:
"""
tsv-select -H -f {params.metadata_fields} \
{input.metadata} > {output.subset_metadata}
"""
Loading