Skip to content

Commit

Permalink
WIP add docs to README
Browse files Browse the repository at this point in the history
WIP as
- I removed the quickstart section. Do we want to maintain this?
- I need to check the "other build customisations" section and ensure
  they still work with the current setup
- this is a prototype, and so the invocation syntax is all subject to
  change!
  • Loading branch information
jameshadfield committed Dec 12, 2024
1 parent 585218f commit c60554a
Showing 1 changed file with 126 additions and 10 deletions.
136 changes: 126 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about

The Snakemake pipeline is parameterised by two config files, one for the A/H5N1, A/H5NX, A/H7N9, and A/H9N2 builds and one for the 2024 A/H5N1 cattle-flu outbreak.

### Table of Contents

## Segment-level GISAID builds
* [Running segment-level GISAID builds](#running-segment-level-gisaid-builds)
* [Running H5N1 Cattle Outbreak (2024) builds](#running-h5n1-cattle-outbreak-2024-builds)
* [Creating a custom build via config overlays](#creating-a-custom-build-via-config-overlays)
* [Running builds in a separate working directory](#running-builds-in-a-separate-working-directory)
* [Cleavage Site Annotations](#cleavage-side-annotations)
* [H5 Clade Labeling](#h5-clade-labeling)
* [Other build customisations](#other-build-customisations)

The `config/gisaid.yaml` config builds 32 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2)) using GISAID data and can be run via

## Running segment-level GISAID builds

The `config/gisaid.yaml` config builds 48 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2) x 1-2 time resolutions (all-time, 2y)) using GISAID data and can be run via

```bash
snakemake --cores 1 -pf --configfile config/gisaid.yaml
Expand All @@ -34,7 +44,7 @@ to nextstrain.org by running:
nextstrain build . --configfile config/gisaid.yaml -f deploy_all
```

## H5N1 Cattle Outbreak (2024)
## Running H5N1 Cattle Outbreak (2024) builds

We produce per-segment and whole-genome (concatenated segments) builds for the ongoing H5N1 cattle-flu outbreak.
These use NCBI data including consensus genomes and SRA data assembled via the Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
Expand Down Expand Up @@ -64,15 +74,119 @@ This should allow any reassortments to be highlighted and will also include outb
> Note that generating any segment-level build here will necessarily build the genome tree, as it's needed to identify the clade of interest in each segment.

## Creating a custom build
The easiest way to generate your own, custom avian-flu build is to use the quickstart-build as a starting template. Simply clone the quickstart-build, run with the example data, and edit the Snakefile to customize. This build includes example data and a simplified, heavily annotated Snakefile that goes over the structure of Snakefiles and annotates rules and inputs/outputs that can be modified. This build, with it's own readme, is available [here](https://github.com/nextstrain/avian-flu/tree/master/quickstart-build).
## Creating a custom build via config overlays

The two config files introduced above are designed to be extended (customised) by overlaying your own YAML configurations.
The aim is to allow easy customisation of the workflow outputs (e.g. which subtypes / segments / time-frames to run) and the stopping points (Auspice JSONs or particular intermediate files) via such overlays.
Additionally, modification of parameters (e.g. clock rates, minimum lengths, subsampling criteria, DTA metadata) is possible through overlays without needing to modify the underlying base config or Snakemake pipeline itself.

Config overlays allow you to essentially maintain one or more modifications of the workflow for your own purposes.
For instance you may want a way to easily run only H5N1 HA+NA 2y builds, and a config overlay can achieve this.
Using an overlay keeps this change separate from the config used for Nextstrain automation and also avoids `git` conflicts emerging over time.
When combined with running in a separate working directory (explained below) this becomes even more powerful.

As an example we'll create a custom config `config_local.yaml` which you can then add to the build command described above, e.g. `nextstrain build . --configfile config/gisaid.yaml config_local.yaml`.
We'll use the GISAID config as the example we're extending, however the concepts are the similar for the H5N1 cattle outbreak config.


### Restrict which builds we want to produce

By default we produce 48 Auspice JSONs (4 subtypes * 8 segments * 1-2 time resolutions). We can restrict these by redefining the builds in our config overlay. For instance the following will produce only 2 datasets, `h5n1/ha/2y` and `h5n1/na/2y`:
```yaml
builds:
- subtype: h5n1
segment:
- ha
- na
time: 2y
```
Here `builds` is an array of sub-configs, each of which define a combination of subtype, segment and time parameters. Each subtype, segment and time can be a single string (as subtype and time are, above) or an array (as segment is, above).

### Only produce certain intermediate files, not Auspice datasets

By default the "target" for each build is the Auspice JSON (`auspice/avian-flu_h5n1_ha_2y.json` etc) however we can change this if we just want certain intermediate files. Adding the following to the config overlay will stop once we've filtered to the metadata & sequences we would use for each tree

```yaml
target_patterns:
- "results/{subtype}/{segment}/{time}/metadata.tsv"
- "results/{subtype}/{segment}/{time}/sequences.tsv"
```

(This works in combination with the custom `builds` definition, above).

### Change the target number of sequences per build

In the `config/gisaid.yaml` this parameter is defined as

```yaml
filter:
target_sequences_per_tree:
"*/*/*": 3000
```

Where the `"*/*/*"` syntax is slash-separated matchers for subtype, segment and time-resolution, and the `*` character means "match everything".
So here we're saying "for every subtype, for every segment, for every time-resolution target 3000 sequences".

If we want to change our h5n1 builds to instead have 5000 sequences (whilst keeping the rest at 3000) we could add the following to our config overlay:
```yaml
filter:
target_sequences_per_tree:
"h5n1/*/*": 5000
```
And since `"h5n1/*/*"` is more specific than `"*/*/*"` it'll take precedence when `subtype="h5n1"`.
(Internally, Snakemake merges these configs together resulting in `'target_sequences_per_tree': {'h5n1/*/*': 5000, '*/*/*': 3000}`. For each combination of subtype/segment/time values within the workflow we consult these dictionaries and pick the most specific match.)


This syntax is concise but powerful, for instance we can parameterise the builds like so:
```yaml
filter:
target_sequences_per_tree:
"*/*/all-time": 5000 # target 5k sequences for the all-time builds
"h5n1/*/all-time": 10000 # but for h5n1 all-time builds target 10k sequences (this is more specific as it only includes one '*' character)
"*/*/*": 1000 # target 1k sequences for any other builds (i.e. the 2y builds)
```



### Change other parameters

The pattern introduced in the preceeding section generally applies for all parameters in the workflow.
By reading through the base config YAML (e.g. `config/gisaid.yaml`) you should be able to learn the config structure and add then modify that within your overlay config as needed.

If the parameter is not exposed via the config YAML and you find yourself modifying the underlying Snakefile consider exposing it via the config so that this customisation then becomes available to config overlays.


## Running builds in a separate working directory

> Note: This section doesn't yet work using AWS / docker runtimes but support is forthcoming.

## Features unique to avian flu builds
We can run the worfklow from an analysis directory separate to this repo. Imagine the following directory structure:

```
├── avian-flu
│ ├── README.md
│ ├── Snakefile
│ └── ... (lots more files here)
└── analysis_dir
└── config.yaml
```

Where `config.yaml` is the your config overlay, introduced in the previous section as `config_local.yaml`.

> TODO XXX following uses a different invocation - this is all in flux

From within the `analysis_dir` we can run `snakemake --snakefile ../avian-flu/gisaid/Snakefile --cores 1` and we'll automatically run the workflow using your `config.yaml` config overlay.
This keeps the workflow outputs separate and isolated from the workflow itself.
Depending on how you run builds this can be very liberating; for instance if you are switching between versions of the workflow you can maintain different analysis directories for each version, or if you are running the workflow at different times you can separate the analyses for easy before/after comparisons.

> You don't have to name it `config.yaml`, but if you use a different filename you'll have to specify it via `--configfile <filename>`.


## Cleavage Site Annotations

#### cleavage site annotations
Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2.

#### clade labeling
## H5 Clade Labeling
H5 viruses are classified into clades, which are currently viewable as a color by on [nextstrain.org](https://nextstrain.org/avian-flu/h5n1/ha?c=h5_label_clade). Because clade annotations are not available in all public databases, we annotate sequences with their most likely clade using a tool developed by Samuel S. Shepard at CDC called [LABEL](https://wonder.cdc.gov/amd/flu/label/). The assigned clade for each H5N1 or H5Nx sequence are available as public tsvs [here](https://github.com/nextstrain/avian-flu/tree/master/clade-labeling).

To update the `clades.tsv` file with clade annotations for new sequences, run:
Expand All @@ -87,7 +201,9 @@ To string these together and update the `clades.tsv` file for new sequences and

`snakemake -s Snakefile.clades -p --cores 1 && snakemake -p --cores 1`

#### Using the same strains for all segments
## Other build customisations

### Using the same strains for all segments

By default we subsample data for each segment independently.
Alternatively, you can ask the pipeline to use the same strains for each segment.
Expand All @@ -106,7 +222,7 @@ If you are using `nextstrain build` then add that to the end of the command (i.e

Note that you may need to remove any existing data in `results/` in order for snakemake to correctly regenerate the intermediate files.

#### Using locally ingested data (instead of downloading from S3)
### Using locally ingested data (instead of downloading from S3)

Run the pipeline with `--config 'local_ingest=True'` to use the locally available files produced by the ingest pipeline (see `./ingest/README.md` for details on how to run).
Specifically, the files needed are `ingest/results/metadata.tsv` and `ingest/results/sequences_{SEGMENT}.fasta`.
Expand Down

0 comments on commit c60554a

Please sign in to comment.