Skip to content

Commit

Permalink
Merge branch 'main' into smw-flye-dev
Browse files Browse the repository at this point in the history
fraser-combe authored Dec 17, 2024
2 parents fcd39b9 + 4fbb73e commit 47155c0
Showing 128 changed files with 2,834 additions and 1,047 deletions.
12 changes: 11 additions & 1 deletion .dockstore.yml
Original file line number Diff line number Diff line change
@@ -195,6 +195,11 @@ workflows:
primaryDescriptorPath: /workflows/utilities/data_import/wf_terra_2_bq.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Fetch_SRR_Accession_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/data_import/wf_fetch_srr_accession.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Concatenate_Column_Content_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/file_handling/wf_concatenate_column.wdl
@@ -282,4 +287,9 @@ workflows:
subclass: WDL
primaryDescriptorPath: /workflows/phylogenetics/wf_snippy_streamline_fasta.wdl
testParameterFiles:
- /tests/inputs/empty.json
- /tests/inputs/empty.json
- name: Concatenate_Illumina_Lanes_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/file_handling/wf_concatenate_illumina_lanes.wdl
testParameterFiles:
- /tests/inputs/empty.json
3 changes: 2 additions & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -45,7 +45,8 @@ This PR uses an element that could cause duplicate runs to have different result
- [ ] The workflow/task has been tested and results, including file contents, are as anticipated
- [ ] The CI/CD has been adjusted and tests are passing (Theiagen developers)
- [ ] Code changes follow the [style guide](https://theiagen.notion.site/Style-Guide-WDL-Workflow-Development-51b66a47dde54c798f35d673fff80249)
- [ ] Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)
- [ ] Documentation and/or workflow diagrams have been updated if applicable
- [ ] You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three `workflows_overview` tables.

## 🎯 Reviewer Checklist
<!-- Indicate NA when not applicable -->
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
cromwell*
_LAST
2024*
2024*site/
site/
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -42,30 +42,32 @@ You can expect a careful review of every PR and feedback as needed before mergin

### Authorship

(Ordered by contribution [# of lines changed] as of 2024-08-01)
(Ordered by contribution [# of lines changed] as of 2024-12-04)

* **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision
* **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation
* **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation
* **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation
* **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation
* **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation
* **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision
* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation
* **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation
* **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
* **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation
* **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision
* **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation
* **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation
* **Joel Sevinsky** ([@sevinsky](https://github.com/sevinsky)) - Conceptualization, Project Administration, Supervision

### External Contributors

We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository:

* **James Otieno** ([@jrotieno](https://github.com/jrotieno))
* **Robert Petit** ([@rpetit3](https://github.com/rpetit3))
* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty))
* **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel))
* **Sam Baird** ([@sam-baird](https://github.com/sam-baird))
* **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead))
* **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1))

### Maintaining PHB Pipelines

Binary file modified docs/assets/figures/Freyja_FASTQ.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/figures/TheiaProk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 6 additions & 4 deletions docs/assets/new_workflow_template.md
Original file line number Diff line number Diff line change
@@ -4,14 +4,16 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Workflow Type](../../workflows_overview/workflows_type.md/#link-to-workflow-type) | [Applicable Kingdom](../../workflows_overview/workflows_kingdom.md/#link-to-applicable-kingdom) | PHB <version with last changes> | <command-line compatibility> | <workflow level on terra> |
| [Link to Workflow Type](../../workflows_overview/workflows_type.md/#link-to-workflow-type) | [Link to Applicable Kingdom](../../workflows_overview/workflows_kingdom.md/#link-to-applicable-kingdom) | PHB <version with last changes\> | <command-line compatibility\> | <workflow level on terra (set or sample)\> |

## Workflow_Name_On_Terra

Description of the workflow.

### Inputs

Input should be ordered as they appear on Terra

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| task_name | **variable_name** | Type | Description | Default Value | Required/Optional |
@@ -24,12 +26,12 @@ Description of the workflow tasks
Description of the task

!!! techdetails "Tool Name Technical Details"
| | Links |
| --- | --- |
| | Links |
| --- | --- |
| Task | [link to task on GitHub] |
| Software Source Code | [link to tool's source code] |
| Software Documentation | [link to tool's documentation] |
| Original Publication | [link to tool's publication] |
| Original Publication(s) | [link to tool's publication] |

### Outputs

400 changes: 243 additions & 157 deletions docs/contributing/code_contribution.md

Large diffs are not rendered by default.

114 changes: 67 additions & 47 deletions docs/contributing/doc_contribution.md
Original file line number Diff line number Diff line change
@@ -14,7 +14,7 @@ To test your documentation changes, you will need to have the following packages
pip install mkdocs-material mkdocs-material-extensions mkdocs-git-revision-date-localized-plugin mike mkdocs-glightbox
```

The live preview server can be activated by running the following command:
Once installed, navigate to the top directory in PHB. The live preview server can be activated by running the following command:

```bash
mkdocs serve
@@ -34,49 +34,7 @@ Here are some VSCode Extensions can help you write and edit your markdown files

- [Excel to Markdown Table](https://tableconvert.com/excel-to-markdown) - This website will convert an Excel table into markdown format, which can be copied and pasted into your markdown file.
- [Material for MkDocs Reference](https://squidfunk.github.io/mkdocs-material/reference/) - This is the official reference for the Material for MkDocs theme, which will help you understand how to use the theme's features.
- [Broken Link Check](https://www.brokenlinkcheck.com/) - This website will scan your website to ensure that all links are working correctly. This will only work on the deployed version of the documentation, not the local version.

## Documentation Structure

A brief description of the documentation structure is as follows:

- `docs/` - Contains the Markdown files for the documentation.
- `assets/` - Contains images and other files used in the documentation.
- `figures/` - Contains images, figures, and workflow diagrams used in the documentation. For workflows that contain many images (such as BaseSpace_Fetch), it is recommended to create a subdirectory for the workflow.
- `files/` - Contains files that are used in the documentation. This may include example outputs or templates. For workflows that contain many files (such as TheiaValidate), it is recommended to create a subdirectory for the workflow.
- `logos/` - Contains Theiagen logos and symbols used int he documentation.
- `metadata_formatters/` - Contains the most up-to-date metadata formatters for our submission workflows.
- `new_workflow_template.md` - A template for adding a new workflow page to the documentation.
- `contributing/` - Contains the Markdown files for our contribution guides, such as this file
- `javascripts/` - Contains JavaScript files used in the documentation.
- `tablesort.js` - A JavaScript file used to enable table sorting in the documentation.
- `overrides/` - Contains HTMLs used to override theme defaults
- `main.html` - Contains the HTML used to display a warning when the latest version is not selected
- `stylesheets/` - Contains CSS files used in the documentation.
- `extra.css` - A custom CSS file used to style the documentation; contains all custom theme elements (scrollable tables, resizable columns, Theiagen colors), and custom admonitions.
- `workflows/` - Contains the Markdown files for each workflow, organized into subdirectories by workflow category
- `workflows_overview/` - Contains the Markdown files for the overview tables for each display type: alphabetically, by applicable kingdom, and by workflow type.
- `index.md` - The home/landing page for our documentation.

### Adding a Page for a New Workflow {#new-page}

If you are adding a new workflow, there are a number of things to do in order to include the page in the documentation:

1. Add a page with the title of the workflow to appropriate subdirectory in `docs/workflows/`. Feel free to use the template found in the `assets/` folder.
2. Collect the following information for your new workflow:
- Workflow Name - Link the name with a relative path to the workflow page in appropriate `docs/workflows/` subdirectory
- Workflow Description - Brief description of the workflow
- Applicable Kingdom - Options: "Any taxa", "Bacteria", "Mycotics", "Viral"
- Workflow Level (_on Terra_) - Options: "Sample-level", "Set-level", or neither
- Command-line compatibility - Options: "Yes", "No", and/or "Some optional features incompatible"
- The version where the last known changes occurred (likely the upcoming version if it is a new workflow)
- Link to the workflow on Dockstore (if applicable) - Workflow name linked to the information tab on Dockstore.
3. Format this information in a table.
4. Copy the previously gathered information to ==**ALL THREE**== overview tables in `docs/workflows_overview/`:
- `workflows_alphabetically.md` - Add the workflow in the appropriate spot based on the workflow name.
- `workflows_kingdom.md` - Add the workflow in the appropriate spot(s) based on the kingdom(s) the workflow is applicable to. Make sure it is added alphabetically within the appropriate subsection(s).
- `workflows_type.md` - Add the workflow in the appropriate spot based on the workflow type. Make sure it is added alphabetically within the appropriate subsection.
5. Copy the path to the workflow to ==**ALL**== of the appropriate locations in the `mkdocs.yml` file (under the `nav:` section) in the main directory of this repository. These should be the exact same spots as in the overview tables but without additional information. This ensures the workflow can be accessed from the navigation sidebar.
- [Dead Link Check](https://www.deadlinkchecker.com/) - This website will scan your website to ensure that all links are working correctly. This will only work on the deployed version of the documentation, not the local version.

## Standard Language & Formatting Conventions

@@ -98,10 +56,11 @@ The following language conventions should be followed when writing documentation
- **Bold Text** - Use `**bold text**` to indicate text that should be bolded.
- _Italicized Text_ - Use `_italicized text_` to indicate text that should be italicized.
- ==Highlighted Text== - Use `==highlighted text==` to indicate text that should be highlighted.
- `Code` - Use \`code\` to indicate text that should be formatted as code.
- `Code` - Use ````code` ``` (backticks) to indicate text that should be formatted as code.
- ^^Underlined Text^^ - Use `^^underlined text^^` to indicate text that should be underlined (works with our theme; not all Markdown renderers support this).
- > Citations
- Use a `>` to activate quote formatting for a citation. Make sure to separate multiple citations with a comment line (`<!-- -->`) to prevent the citations from running together.
- Use a reputable citation style (e.g., Vancouver, Nature, etc.) for all citations.
- Callouts/Admonitions - These features are called "call-outs" in Notion, but are "Admonitions" in MkDocs. [I highly recommend referring to the Material for MkDocs documentation page on Admonitions to learn how best to use this feature](https://squidfunk.github.io/mkdocs-material/reference/admonitions/). Use the following syntax to create a callout:

```markdown
@@ -116,26 +75,45 @@ The following language conventions should be followed when writing documentation
!!! dna
This is a DNA admonition. Admire the cute green DNA emoji. You can create this with the `!!! dna` syntax.

Use this admonition when wanting to convey general information or highlight specific facts.

???+ toggle
This is a toggle-able section. The emoji is an arrow pointing to the right downward. You can create this with the `??? toggle` syntax. I have added a `+` at the end of the question marks to make it open by default.

Use this admonition when wanting to provide additional _optional_ information or details that are not strictly necessary, or take up a lot of space.

???+ task
This is a toggle-able section **for a workflow task**. The emoji is a gear. Use the `??? task` syntax to create this admonition. Use `!!! task` if you want to have it be permanently expanded. I have add a `+` at the end of the question marks to make this admonition open by default and still enable its collapse.

Use this admonition when providing details on a workflow, task, or tool.

!!! caption
This is a caption. The emoji is a painting. You can create this with the `!!! caption` syntax. This is used to enclose an image in a box and looks nice. A caption can be added beneath the picture and will also look nice.
This is a caption. The emoji is a painting. You can create this with the `!!! caption` syntax. A caption can be added beneath the picture and will also look nice.

Use this admonition when including images or diagrams in the documentation.

!!! techdetails
This is where you will put technical details for a workflow task. You can create this by `!!! techdetails` syntax.

Use this admonition when providing technical details for a workflow task or tool. These admonitions should include the following table:

| | Links |
| --- | --- |
| Task | [link to the task file in the PHB repository on GitHub] |
| Software Source Code | [link to tool's source code] |
| Software Documentation | [link to tool's documentation] |
| Original Publication(s) | [link to tool's publication] |

If any of these items are unfillable, delete the row.

- Images - Use the following syntax to insert an image:

```markdown
!!! caption "Image Title"
![Alt Text](/path/to/image.png)
```

- Indentation - **_FOUR_** spaces are required instead of the typical two. This is a side effect of using this theme. If you use two spaces, the list and/or indentations will not render correctly. This will make your linter sad :(
- Indentation - **_FOUR_** spaces are required instead of the typical two. This is a side effect of using this theme. If you use two spaces, the list and/or indentations will not render correctly. This will make your linter sad :(

```markdown
- first item
@@ -160,3 +138,45 @@ The following language conventions should be followed when writing documentation
```

- End all pages with an empty line

## Documentation Structure

A brief description of the documentation structure is as follows:

- `docs/` - Contains the Markdown files for the documentation.
- `assets/` - Contains images and other files used in the documentation.
- `figures/` - Contains images, figures, and workflow diagrams used in the documentation. For workflows that contain many images (such as BaseSpace_Fetch), it is recommended to create a subdirectory for the workflow.
- `files/` - Contains files that are used in the documentation. This may include example outputs or templates. For workflows that contain many files (such as TheiaValidate), it is recommended to create a subdirectory for the workflow.
- `logos/` - Contains Theiagen logos and symbols used in the documentation.
- `metadata_formatters/` - Contains the most up-to-date metadata formatters for our submission workflows.
- `new_workflow_template.md` - A template for adding a new workflow page to the documentation. You can see this template [here](../assets/new_workflow_template.md)
- `contributing/` - Contains the Markdown files for our contribution guides, such as this file
- `javascripts/` - Contains JavaScript files used in the documentation.
- `tablesort.js` - A JavaScript file used to enable table sorting in the documentation.
- `overrides/` - Contains HTMLs used to override theme defaults
- `main.html` - Contains the HTML used to display a warning when the latest version is not selected
- `stylesheets/` - Contains CSS files used in the documentation.
- `extra.css` - A custom CSS file used to style the documentation; contains all custom theme elements (scrollable tables, resizable columns, Theiagen colors), and custom admonitions.
- `workflows/` - Contains the Markdown files for each workflow, organized into subdirectories by workflow category
- `workflows_overview/` - Contains the Markdown files for the overview tables for each display type: alphabetically, by applicable kingdom, and by workflow type.
- `index.md` - The home/landing page for our documentation.

### Adding a Page for a New Workflow {#new-page}

If you are adding a new workflow, there are a number of things to do in order to include the page in the documentation:

1. Add a page with the title of the workflow to appropriate subdirectory in `docs/workflows/`. Feel free to use the template found in the `assets/` folder.
2. Collect the following information for your new workflow:
- Workflow Name - Link the name with a relative path to the workflow page in appropriate `docs/workflows/` subdirectory
- Workflow Description - Brief description of the workflow
- Applicable Kingdom - Options: "Any taxa", "Bacteria", "Mycotics", "Viral"
- Workflow Level (_on Terra_) - Options: "Sample-level", "Set-level", or neither
- Command-line compatibility - Options: "Yes", "No", and/or "Some optional features incompatible"
- The version where the last known changes occurred (likely the upcoming version if it is a new workflow)
- Link to the workflow on Dockstore (if applicable) - Workflow name linked to the information tab on Dockstore.
3. Format this information in a table.
4. Copy the previously gathered information to ==**ALL THREE**== overview tables in `docs/workflows_overview/`:
- `workflows_alphabetically.md` - Add the workflow in the appropriate spot based on the workflow name.
- `workflows_kingdom.md` - Add the workflow in the appropriate spot(s) based on the kingdom(s) the workflow is applicable to. Make sure it is added alphabetically within the appropriate subsection(s).
- `workflows_type.md` - Add the workflow in the appropriate spot based on the workflow type. Make sure it is added alphabetically within the appropriate subsection.
5. Copy the path to the workflow to ==**ALL**== of the appropriate locations in the `mkdocs.yml` file (under the `nav:` section) in the main directory of this repository. These should be the exact same spots as in the overview tables but without additional information. This ensures the workflow can be accessed from the navigation sidebar.
12 changes: 7 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -46,7 +46,7 @@ When undertaking genomic analysis using the command-line, via Terra, or other da
We continuously work to improve our codebase and usability of our workflows by the public health community, so changes from version to version are expected. This documentation page reflects the state of the workflow at the version stated in the title.

!!! dna "What's new?"
You can see the changes since PHB v2.2.0 [**here**](https://theiagen.notion.site/Public-Health-Bioinformatics-v2-2-1-Patch-Release-Notes-104cb013bc9380bcbd70dab04bf671a8?pvs=74)!
You can see the changes since PHB v2.2.1 [**here**](https://theiagen.notion.site/public-health-bioinformatics-v2-3-0-minor-release-notes?pvs=4)!

## Contributing to the PHB Repository

@@ -60,30 +60,32 @@ You can expect a careful review of every PR and feedback as needed before mergin

### Authorship

(Ordered by contribution [# of lines changed] as of 2024-08-01)
(Ordered by contribution [# of lines changed] as of 2024-12-04)

- **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision
- **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation
- **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation
- **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation
- **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation
- **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation
- **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision
- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation
- **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation
- **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
- **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation
- **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision
- **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation
- **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation
- **Joel Sevinsky** ([@sevinsky](https://github.com/sevinsky)) - Conceptualization, Project Administration, Supervision

### External Contributors

We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository:

- **James Otieno** ([@jrotieno](https://github.com/jrotieno))
- **Robert Petit** ([@rpetit3](https://github.com/rpetit3))
- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty))
- **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel))
- **Sam Baird** ([@sam-baird](https://github.com/sam-baird))
- **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead))
- **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1))

### On the Shoulder of Giants

64 changes: 64 additions & 0 deletions docs/javascripts/table-search.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
function addTableSearch() {
// Select all containers with the class 'searchable-table'
const containers = document.querySelectorAll('.searchable-table');

containers.forEach((container) => {
// Find the table within this container
const table = container.querySelector('table');

if (table) {
// Ensure we don't add multiple search boxes
if (!container.querySelector('input[type="search"]')) {
// Create the search input element
const searchInput = document.createElement("input");
searchInput.setAttribute("type", "search");
searchInput.setAttribute("placeholder", "Search table...");
searchInput.classList.add('table-search-input');
searchInput.style.marginBottom = "10px";
searchInput.style.display = "block";

// Insert the search input before the table
container.insertBefore(searchInput, container.firstChild);

// Add event listener for table search
searchInput.addEventListener("input", function () {
const filter = searchInput.value.toUpperCase();
const rows = table.getElementsByTagName("tr");

for (let i = 1; i < rows.length; i++) { // Skip header row
const cells = rows[i].getElementsByTagName("td");
let match = false;

for (let j = 0; j < cells.length; j++) {
if (cells[j].innerText.toUpperCase().includes(filter)) {
match = true;
break;
}
}

rows[i].style.display = match ? "" : "none";
}
});
}
} else {
console.log('Table not found within container.');
}
});
}

// Run on page load
addTableSearch();

// Reapply search bar on page change
function observeDOMChanges() {
const targetNode = document.querySelector('body');
const config = { childList: true, subtree: true };

const observer = new MutationObserver(() => {
addTableSearch();
});

observer.observe(targetNode, config);
}

observeDOMChanges();
5 changes: 0 additions & 5 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -6,8 +6,3 @@
<strong>Click here to go to the latest version release.</strong>
</a>
{% endblock %}


{% block announce %}
<center>🏗️ I'm under construction! Pardon the dust while we remodel! 👷</center>
{% endblock %}
29 changes: 29 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -184,5 +184,34 @@ th {
td {
word-break: break-all;
}
/* Base styles for the search box */
div.searchable-table input.table-search-input {
width: 25%;
padding: 10px;
margin-bottom: 12px;
font-size: 12px;
box-sizing: border-box;
border-radius: 2px;
}

/* Light mode styles */
[data-md-color-scheme="light"] div.searchable-table input.table-search-input {
background-color: #fff;
color: #000;
border: 1px solid #E0E1E1;
}
[data-md-color-scheme="light"] div.searchable-table input.table-search-input::placeholder {
color: #888;
font-style: italic;
}

/* Dark mode styles */
[data-md-color-scheme="slate"] div.searchable-table input.table-search-input {
background-color: #1d2125;
color: #fff;
border: 1px solid #373B40;
}
[data-md-color-scheme="slate"] div.searchable-table input.table-search-input::placeholder {
color: #bbb;
font-style: italic;
}
4 changes: 4 additions & 0 deletions docs/workflows/data_export/concatenate_column_content.md
Original file line number Diff line number Diff line change
@@ -16,6 +16,8 @@ This set-level workflow will create a file containing all of the items from a gi

This workflow runs on the set level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| concatenate_column_content | **concatenated_file_name** | String | The name of the output file. ***Include the extension***, such as ".fasta" or ".txt". | | Required |
@@ -28,6 +30,8 @@ This workflow runs on the set level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

!!! info
4 changes: 4 additions & 0 deletions docs/workflows/data_export/transfer_column_content.md
Original file line number Diff line number Diff line change
@@ -25,6 +25,8 @@ This set-level workflow will transfer all of the items from a given column in a

This workflow runs on the set level.

<div class="searchable-table" markdown="1">

| **Terra Task name** | **input_variable** | **Type** | **Description** | **Default attribute** | **Status** |
|---|---|---|---|---|---|
| transfer_column_content | **files_to_transfer** | Array[File] | The column that has the files you want to concatenate. | | Required |
@@ -36,6 +38,8 @@ This workflow runs on the set level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

!!! info
4 changes: 4 additions & 0 deletions docs/workflows/data_export/zip_column_content.md
Original file line number Diff line number Diff line change
@@ -16,6 +16,8 @@ This workflow will create a zip file that contains all of the items in a column

This workflow runs on the set level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| zip_column_content | **files_to_zip** | Array[File] | The column that has the files you want to zip. | | Required |
@@ -27,6 +29,8 @@ This workflow runs on the set level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

!!! info
10 changes: 9 additions & 1 deletion docs/workflows/data_import/assembly_fetch.md
Original file line number Diff line number Diff line change
@@ -23,6 +23,8 @@ Assembly_Fetch requires the input samplename, and either the accession for a ref

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| reference_fetch | **samplename** | String | Your sample's name | | Required |
@@ -44,6 +46,8 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Analysis Tasks

??? task "ReferenceSeeker (optional) Details"
@@ -90,6 +94,8 @@ This workflow runs on the sample level.

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| assembly_fetch_analysis_date | String | Date of assembly download |
@@ -101,11 +107,13 @@ This workflow runs on the sample level.
| assembly_fetch_ncbi_datasets_version | String | NCBI datasets version used |
| assembly_fetch_referenceseeker_database | String | ReferenceSeeker database used |
| assembly_fetch_referenceseeker_docker | String | Docker file used for ReferenceSeeker |
| assembly_fetch_referenceseeker_top_hit_ncbi_accession | String | NCBI Accession for the top it identified by Assembly_Fetch |
| assembly_fetch_referenceseeker_top_hit_ncbi_accession | String | NCBI Accession for the top hit identified by Assembly_Fetch |
| assembly_fetch_referenceseeker_tsv | File | TSV file of the top hits between the query genome and the Reference Seeker database |
| assembly_fetch_referenceseeker_version | String | ReferenceSeeker version used |
| assembly_fetch_version | String | The version of the repository the Assembly Fetch workflow is in |

</div>

## References

> **ReferenceSeeker:** Schwengers O, Hain T, Chakraborty T, Goesmann A. ReferenceSeeker: rapid determination of appropriate reference genomes. J Open Source Softw. 2020 Feb 4;5(46):1994.
4 changes: 4 additions & 0 deletions docs/workflows/data_import/basespace_fetch.md
Original file line number Diff line number Diff line change
@@ -153,6 +153,8 @@ This process must be performed on a command-line (ideally on a Linux or MacOS co

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| basespace_fetch | **access_token** | String | The access token is used in place of a username and password to allow the workflow to access the user account in BaseSpace from which the data is to be transferred. It is an alphanumeric string that is 32 characters in length. Example: 9e08a96471df44579b72abf277e113b7 | | Required |
@@ -168,6 +170,8 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### **Outputs**

The outputs of this workflow will be the fastq files imported from BaseSpace into the data table where the sample ID information had originally been uploaded.
4 changes: 4 additions & 0 deletions docs/workflows/data_import/create_terra_table.md
Original file line number Diff line number Diff line change
@@ -19,6 +19,8 @@ The manual creation of Terra tables can be tedious and error-prone. This workflo

**_This can be changed_** by providing information in the `file_ending` optional input parameter. See below for more information.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| create_terra_table | **assembly_data** | Boolean | Set to true if your data is in FASTA format; set to false if your data is FASTQ format | | Required |
@@ -33,6 +35,8 @@ The manual creation of Terra tables can be tedious and error-prone. This workflo
| make_table | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21" | Optional |
| make_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |

</div>

### Finding the `data_location_path`

#### Using the Terra data uploader
8 changes: 8 additions & 0 deletions docs/workflows/data_import/sra_fetch.md
Original file line number Diff line number Diff line change
@@ -16,6 +16,8 @@ Read files associated with the SRA run accession provided as input are copied to

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| fetch_sra_to_fastq | **sra_accession** | String | SRA, ENA, or DRA accession number | | Required |
@@ -25,6 +27,8 @@ This workflow runs on the sample level.
| fetch_sra_to_fastq | **fastq_dl_options** | String | Additional parameters to pass to fastq_dl from [here](https://github.com/rpetit3/fastq-dl?tab=readme-ov-file#usage) | "--provider sra" | Optional |
| fetch_sra_to_fastq | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |

</div>

The only required input for the SRA_Fetch workflow is an SRA run accession beginning "SRR", an ENA run accession beginning "ERR", or a DRA run accession which beginning "DRR".

Please see the [NCBI Metadata and Submission Overview](https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/) for assistance with identifying accessions. Briefly, NCBI-accessioned objects have the following naming scheme:
@@ -41,6 +45,8 @@ Read data are available either with full base quality scores (**SRA Normalized F

Given the lack of usefulness of SRA Lite formatted FASTQ files, we try to avoid these by selecting as provided SRA directly (SRA-Lite is more probably to be the file synced to other repositories), but some times downloading these files is unavoidable. To make the user aware of this, a warning column is present that is populated when an SRA-Lite file is detected.

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** | **Production Status** |
|---|---|---|---|
| read1 | File | File containing the forward reads | Always produced |
@@ -51,6 +57,8 @@ Given the lack of usefulness of SRA Lite formatted FASTQ files, we try to avoid
| fastq_dl_version | String | Fastq_dl version used | Always produced |
| fastq_dl_warning | String | This warning field is populated if SRA-Lite files are detected. These files contain all quality encoding as Phred-30 or Phred-3. | Depends on internal workflow logic |

</div>

## References

> This workflow relies on [fastq-dl](https://github.com/rpetit3/fastq-dl), a very handy bioinformatics tool by Robert A. Petit III
76 changes: 55 additions & 21 deletions docs/workflows/genomic_characterization/freyja.md

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions docs/workflows/genomic_characterization/pangolin_update.md
Original file line number Diff line number Diff line change
@@ -14,6 +14,8 @@ The Pangolin_Update workflow re-runs Pangolin updating prior lineage calls from

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| pangolin_update | **assembly_fasta** | File | SARS-CoV-2 assembly file in FASTA format | | Required |
@@ -42,8 +44,12 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| **pango_lineage** | String | Pango lineage as determined by Pangolin |
@@ -58,3 +64,9 @@ This workflow runs on the sample level.
| **pangolin_update_version** | String | Version of the Public Health Bioinformatics (PHB) repository used |
| **pangolin_updates** | String | Result of Pangolin Update (lineage changed versus unchanged) with lineage assignment and date of analysis |
| **pangolin_versions** | String | All Pangolin software and database versions |

</div>

## References

> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519.
174 changes: 132 additions & 42 deletions docs/workflows/genomic_characterization/theiacov.md

Large diffs are not rendered by default.

289 changes: 220 additions & 69 deletions docs/workflows/genomic_characterization/theiaeuk.md

Large diffs are not rendered by default.

80 changes: 66 additions & 14 deletions docs/workflows/genomic_characterization/theiameta.md

Large diffs are not rendered by default.

193 changes: 148 additions & 45 deletions docs/workflows/genomic_characterization/theiaprok.md

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion docs/workflows/genomic_characterization/vadr_update.md
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v1.2.1 | Yes | Sample-level |
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.2.0 | Yes | Sample-level |

## Vadr_Update_PHB

@@ -29,6 +29,8 @@ Please note the default values are for SARS-CoV-2.

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| vadr_update | **assembly_length_unambiguous** | Int | Number of unambiguous basecalls within the consensus assembly | | Required |
@@ -44,6 +46,8 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

| **Variable** | **Type** | **Description** |
36 changes: 27 additions & 9 deletions docs/workflows/phylogenetic_construction/augur.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.1.0 | Yes | Sample-level, Set-level |
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.3.0 | Yes | Sample-level, Set-level |

## Augur Workflows

@@ -14,10 +14,10 @@ Two workflows are offered: **Augur_Prep_PHB** and **Augur_PHB**. These must be r

!!! dna "**Helpful resources for epidemiological interpretation**"

- [introduction to Nextstrain](https://www.cdc.gov/amd/training/covid-toolkit/module3-1.html) (which includes Auspice)
- guide to Nextstrain [interactive trees](https://www.cdc.gov/amd/training/covid-toolkit/module3-4.html)
- an [introduction to UShER](https://www.cdc.gov/amd/training/covid-toolkit/module3-3.html)
- a video about [how to read trees](https://www.cdc.gov/amd/training/covid-toolkit/module1-3.html) if this is new to you
- [introduction to Nextstrain](https://www.cdc.gov/advanced-molecular-detection/php/training/module-3-1.html) (which includes Auspice)
- guide to Nextstrain [interactive trees](https://www.cdc.gov/advanced-molecular-detection/php/training/module-3-4.html)
- an [introduction to UShER](https://www.cdc.gov/advanced-molecular-detection/php/training/module-3-3.html)
- a video about [how to read trees](https://www.cdc.gov/advanced-molecular-detection/php/training/module-1-3.html) if this is new to you
- documentation on [how to identify SARS-CoV-2 recombinants](https://github.com/pha4ge/pipeline-resources/blob/main/docs/sc2-recombinants.md)

### Augur_Prep_PHB
@@ -30,6 +30,8 @@ The Augur_Prep_PHB workflow takes assembly FASTA files and associated metadata f

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| augur_prep | **assembly** | File | Assembly/consensus file (single FASTA file per sample) | | Required |
@@ -48,6 +50,8 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

#### Augur_Prep Outputs

| **Variable** | **Type** | **Description** |
@@ -70,7 +74,7 @@ The Augur_PHB workflow takes in a ***set*** of SARS-CoV-2 (or any other viral
!!! dna "Optional Inputs"
There are **many** optional user inputs. For SARS-CoV-2, Flu, rsv-a, rsv-b, and mpxv, default values that mimic the NextStrain builds have been preselected. To use these defaults, you must write either `"sars-cov-2"`,`"flu"`, `"rsv-a"`, `"rsv-b"`, or `"mpxv"` for the `organism` variable.

For Flu - it is **required** to set `flu_segment` to either `"HA"` or `"NA"` & `flu_subtype` to either `"H1N1"` or `"H3N2"` or `"Victoria"` or `"Yamagata"` depending on your set of samples.
For Flu - it is **required** to set `flu_segment` to either `"HA"` or `"NA"` & `flu_subtype` to either `"H1N1"` or `"H3N2"` or `"Victoria"` or `"Yamagata"` or `"H5N1"` (`"H5N1"` will only work with `"HA"`) depending on your set of samples.

???+ toggle "A Note on Optional Inputs"
??? toggle "Default values for SARS-CoV-2"
@@ -121,6 +125,11 @@ The Augur_PHB workflow takes in a ***set*** of SARS-CoV-2 (or any other viral
- clades_tsv = `"gs://theiagen-public-files-rp/terra/flu-references/clades_yam_ha.tsv"`
- NA
- reference_fasta = `"gs://theiagen-public-files-rp/terra/flu-references/reference_yam_na.gb"`
??? toggle "H5N1"
- auspice_config = `"gs://theiagen-public-files-rp/terra/flu-references/auspice_config_h5n1.json"`
- HA
- reference_fasta = `"gs://theiagen-public-files-rp/terra/flu-references/reference_h5n1_ha.gb"`
- clades_tsv = `"gs://theiagen-public-files-rp/terra/flu-references/h5nx-clades.tsv"`

??? toggle "Default values for MPXV"
- min_num_unambig = 150000
@@ -165,6 +174,8 @@ The Augur_PHB workflow takes in a ***set*** of SARS-CoV-2 (or any other viral

This workflow runs on the set level. Please note that for every task, runtime parameters are modifiable (cpu, disk_size, docker, and memory); most of these values have been excluded from the table below for convenience.

<div class="searchable-table" markdown="1" width=100vw>

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| augur | **assembly_fastas** | Array[File] | An array of the assembly files to use; use either the HA or NA segment for flu samples | | Required |
@@ -173,7 +184,7 @@ This workflow runs on the set level. Please note that for every task, runtime pa
| augur | **clades_tsv** | File | TSV file containing clade mutation positions in four columns | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: <https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl>. For an organism without set defaults, an empty clades file is provided to prevent workflow failure, "gs://theiagen-public-files-rp/terra/augur-defaults/minimal-clades.tsv", but will not be as useful as an organism specific clades file. | Optional, Required |
| augur | **distance_tree_only** | Boolean | Create only a distance tree (skips all Augur steps after augur_tree) | TRUE | Optional |
| augur | **flu_segment** | String | Required if organism = "flu". The name of the segment to be analyzed; options: "HA" or "NA" | "HA" (only used if organism = "flu") | Optional, Required |
| augur | **flu_subtype** | String | Required if organism = "flu". The subtype of the flu samples being analyzed; options: "H1N1", "H3N2", "Victoria", "Yamagata" | | Optional, Required |
| augur | **flu_subtype** | String | Required if organism = "flu". The subtype of the flu samples being analyzed; options: "H1N1", "H3N2", "Victoria", "Yamagata", "H5N1" | | Optional, Required |
| augur | **lat_longs_tsv** | File | Tab-delimited file of geographic location names with corresponding latitude and longitude values | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: <https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl>. For an organism without set defaults, a minimal lat-long file is provided to prevent workflow failure, "gs://theiagen-public-files-rp/terra/augur-defaults/minimal-lat-longs.tsv", but will not be as useful as a detailed lat-longs file covering all the locations for the samples to be visualized. | Optional |
| augur | **min_date** | Float | Minimum date to begin filtering or frequencies calculations | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: <https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl>. For an organism without set defaults, the default value is 0.0 | Optional |
| augur | **min_num_unambig** | Int | Minimum number of called bases in genome to pass prefilter | Defaults are organism-specific. Please find default values for all organisms (and for Flu - their respective genome segments and subtypes) here: <https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl>. For an organism without set defaults, the default value is 0 | Optional |
@@ -187,7 +198,7 @@ This workflow runs on the set level. Please note that for every task, runtime pa
| augur_ancestral | **inference** | String | Calculate joint or marginal maximum likelihood ancestral sequence states; options: "joint", "marginal" | joint | Optional |
| augur_ancestral | **keep_ambiguous** | Boolean | If true, do not infer nucleotides at ambiguous (N) sides | FALSE | Optional |
| augur_ancestral | **keep_overhangs** | Boolean | If true, do not infer nucleotides for gaps on either side of the alignment | FALSE | Optional |
| augur_export | **colors_tsv** | File | Custom color definitions, one per line in the format TRAIT_TYPE \| TRAIT_VALUE\tHEX_CODE | | Optional |
| augur_export | **colors_tsv** | File | Custom color definitions, one per line in TSV format with the following fields: TRAIT_TYPE TRAIT_VALUE HEX_CODE | | Optional |
| augur_export | **description_md** | File | Markdown file with description of build and/or acknowledgements | | Optional |
| augur_export | **include_root_sequence** | Boolean | Export an additional JSON containing the root sequence used to identify mutations | FALSE | Optional |
| augur_export | **title** | String | Title to be displayed by Auspice | | Optional |
@@ -209,7 +220,7 @@ This workflow runs on the set level. Please note that for every task, runtime pa
| augur_tree | **exclude_sites** | File | File of one-based sites to exclude for raw tree building (BED format in .bed files, DRM format in tab-delimited files, or one position per line) | | Optional |
| augur_tree | **method** | String | Which method to use to build the tree; options: "fasttree", "raxml", "iqtree" | iqtree | Optional |
| augur_tree | **override_default_args** | Boolean | If true, override default tree builder arguments instead of augmenting them | FALSE | Optional |
| augur_tree | **substitution_model** | String | The substitution model to use; only available for iqtree. Specify "auto" to run ModelTest; options: "GTR" | GTR | Optional |
| augur_tree | **substitution_model** | String | The substitution model to use; only available for iqtree. Specify "auto" to run ModelTest; model options can be found [here](http://www.iqtree.org/doc/Substitution-Models) | GTR | Optional |
| augur_tree | **tree_builder_args** | String | Additional tree builder arguments either augmenting or overriding the default arguments. FastTree defaults: "-nt -nosupport". RAxML defaults: "-f d -m GTRCAT -c 25 -p 235813". IQ-TREE defaults: "-ninit 2 -n 2 -me 0.05 -nt AUTO -redo" | | Optional |
| sc2_defaults | **nextstrain_ncov_repo_commit** | String | The version of the <https://github.com/nextstrain/ncov/> from which to draw default values for SARS-CoV-2. | `23d1243127e8838a61b7e5c1a72bc419bf8c5a0d` | Optional |
| organism_parameters | **gene_locations_bed_file** | File | Use to provide locations of interest where average coverage will be calculated | Defaults are organism-specific. Please find default values for some organisms here: <https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_organism_parameters.wdl>. For an organism without set defaults, an empty file is provided, "gs://theiagen-public-files/terra/theiacov-files/empty.bed", but will not be as useful as an organism specific gene locations bed file. | Optional |
@@ -230,6 +241,8 @@ This workflow runs on the set level. Please note that for every task, runtime pa
| mutation_context | **docker** | String | Docker image used for the mutation_context task that is specific to Mpox. Do not modify. | us-docker.pkg.dev/general-theiagen/theiagen/nextstrain-mpox-mutation-context:2024-06-27 | Do Not Modify, Optional |
| mutation_context | **memory** | Int | Memory size in GB requested for the mutation_context task that is specific to Mpox. | 4 | Optional |

</div>

??? task "Workflow Tasks"
##### Augur Workflow Tasks {#augur-tasks}

@@ -271,8 +284,13 @@ The Nextstrain team hosts documentation surrounding the Augur workflow → Auspi
| **Variable** | **Type** | **Description** |
| --- | --- | --- |
| aligned_fastas | File | A FASTA file of the aligned genomes |
| augur_fasttree_version | String | The fasttree version used, blank if other tree method used |
| augur_iqtree_model_used | String | The iqtree model used during augur tree, blank if iqtree not used |
| augur_iqtree_version | String | The iqtree version used during augur tree (defualt), blank if other tree method used |
| augur_mafft_version | String | The mafft version used in augur align |
| augur_phb_analysis_date | String | The date the analysis was run |
| augur_phb_version | String | The version of the Public Health Bioinformatics (PHB) repository used |
| augur_raxml_version | String | The version of raxml used during augur tree, blank if other tree method used |
| augur_version | String | Version of Augur used |
| auspice_input_json | File | JSON file used as input to Auspice |
| combined_assemblies | File | Concatenated FASTA file containing all samples |
8 changes: 8 additions & 0 deletions docs/workflows/phylogenetic_construction/core_gene_snp.md
Original file line number Diff line number Diff line change
@@ -22,6 +22,8 @@ For further detail regarding Pirate options, please see [PIRATE's documentation)

This workflow runs on the set level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| core_gene_snp_workflow | **cluster_name** | String | Name of sample set | | Required |
@@ -84,6 +86,8 @@ This workflow runs on the set level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Tasks

By default, the Core_Gene_SNP workflow will begin by analyzing the input sample set using [PIRATE](https://github.com/SionBayliss/PIRATE). Pirate takes in GFF3 files and classifies the genes into gene families by sequence identity, outputting a pangenome summary file. The workflow will instruct Pirate to create core gene and pangenome alignments using this gene family data. Setting the "align" input variable to false will turn off this behavior, and the workflow will output only the pangenome summary. The workflow will then use the core gene alignment from `Pirate` to infer a phylogenetic tree using `IQ-TREE`. It will also produce an SNP distance matrix from this alignment using [snp-dists](https://github.com/tseemann/snp-dists). This behavior can be turned off by setting the `core_tree` input variable to false. The workflow will not create a pangenome tree or SNP-matrix by default, but this behavior can be turned on by setting the `pan_tree` input variable to true.
@@ -98,6 +102,8 @@ By default, this task appends a Phandango coloring tag to color all items from t

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| core_gene_snp_wf_analysis_date | String | Date of analysis using Core_Gene_SNP workflow |
@@ -118,6 +124,8 @@ By default, this task appends a Phandango coloring tag to color all items from t
| pirate_snp_dists_version | String | Version of snp-dists used |
| pirate_summarized_data | File | The presence/absence matrix generated by the summarize_data task from the list of columns provided |

</div>

## References

>Sion C Bayliss, Harry A Thorpe, Nicola M Coyle, Samuel K Sheppard, Edward J Feil, PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria, *GigaScience*, Volume 8, Issue 10, October 2019, giz119, <https://doi.org/10.1093/gigascience/giz119>
6 changes: 5 additions & 1 deletion docs/workflows/phylogenetic_construction/czgenepi_prep.md
Original file line number Diff line number Diff line change
@@ -18,13 +18,15 @@ Variables with both the "Optional" and "Required" tag require the column (regard

This workflow runs on the set level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| czgenepi_prep | **sample_names** | Array[String] | The array of sample ids you want to prepare for CZ GEN EPI | | Required |
| czgenepi_prep | **terra_table_name** | String | The name of the Terra table where the data is hosted | | Required |
| czgenepi_prep | **terra_project_name** | String | The name of the Terra project where the data is hosted | | Required |
| czgenepi_prep | **terra_workspace_name** | String | The name of the Terra workspace where the data is hosted | | Required |
| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 10 | Optional |
| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| download_terra_table | **docker** | String | The Docker container to use for the task | quay.io/theiagen/terra-tools:2023-06-21 | Optional |
| download_terra_table | **disk_size** | String | The size of the disk used when running this task | 1 | Optional |
| download_terra_table | **cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional |
@@ -46,6 +48,8 @@ This workflow runs on the set level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

The concatenated_czgenepi_fasta and concatenated_czgenepi_metadata files can be uploaded directly to CZ GEN EPI without any adjustments.
Original file line number Diff line number Diff line change
@@ -20,6 +20,8 @@ The primary intended input of the workflow is the `snippy_variants_results` outp

All variant data included in the sample set should be generated from aligning sequencing reads to the **same reference genome**. If variant data was generated using different reference genomes, shared variants cannot be identified and results will be less useful.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
| --- | --- | --- | --- | --- | --- |
| shared_variants_wf | **concatenated_file_name** | String | String of your choice to prefix output files | | Required |
@@ -33,6 +35,8 @@ All variant data included in the sample set should be generated from aligning se
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Tasks

??? task "Concatenate Variants"
8 changes: 8 additions & 0 deletions docs/workflows/phylogenetic_construction/ksnp3.md
Original file line number Diff line number Diff line change
@@ -19,6 +19,8 @@ You can learn more about the kSNP3 workflow, including how to visualize the outp

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| ksnp3_workflow | **assembly_fasta** | Array[File] | The assembly files to be analyzed | | Required |
@@ -62,6 +64,8 @@ You can learn more about the kSNP3 workflow, including how to visualize the outp
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Actions

The `ksnp3` workflow is run on the set of assembly files to produce both pan-genome and core-genome phylogenies. This also results in alignment files which - are used by [`snp-dists`](https://github.com/tseemann/snp-dists) to produce a pairwise SNP distance matrix for both the pan-genome and core-genomes.
@@ -86,6 +90,8 @@ If you fill out the `data_summary_*` and `sample_names` optional variables, you

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| ksnp3_core_snp_matrix | File | The SNP matrix made with the core genome; formatted for Phandango if `phandango_coloring` input is `true` |
@@ -109,6 +115,8 @@ If you fill out the `data_summary_*` and `sample_names` optional variables, you
| ksnp3_wf_analysis_date | String | The date the workflow was run |
| ksnp3_wf_version | String | The version of the repository the workflow is hosted in |

</div>

## References

>Shea N Gardner, Tom Slezak, Barry G. Hall, kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome, *Bioinformatics*, Volume 31, Issue 17, 1 September 2015, Pages 2877–2878, <https://doi.org/10.1093/bioinformatics/btv271>
4 changes: 4 additions & 0 deletions docs/workflows/phylogenetic_construction/lyve_set.md
Original file line number Diff line number Diff line change
@@ -17,6 +17,8 @@ The Lyve_SET WDL workflow runs the [Lyve-SET](https://github.com/lskatz/lyve-SET

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| lyveset_workflow | **dataset_name** | String | Free text string used to label output files | | Required |
@@ -45,6 +47,8 @@ The Lyve_SET WDL workflow runs the [Lyve-SET](https://github.com/lskatz/lyve-SET
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Actions

The Lyve_SET WDL workflow is run using read data from a set of samples. The workflow will produce a pairwise SNP matrix for the sample set and a maximum likelihood phylogenetic tree. Details regarding the default implementation of Lyve_SET and optional modifications are listed below.
8 changes: 8 additions & 0 deletions docs/workflows/phylogenetic_construction/mashtree_fasta.md
Original file line number Diff line number Diff line change
@@ -16,6 +16,8 @@ This workflow also features an optional module, `summarize_data`, that creates a

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| mashtree_fasta | **assembly_fasta** | Array[File] | The set of assembly fastas | | Required |
@@ -49,6 +51,8 @@ This workflow also features an optional module, `summarize_data`, that creates a
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Actions

`MashTree_Fasta` is run on a set of assembly fastas and creates a phylogenetic tree and matrix. These outputs are passed to a task that will rearrange the matrix to match the order of the terminal ends in the phylogenetic tree.
@@ -63,6 +67,8 @@ By default, this task appends a Phandango coloring tag to color all items from t

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| mashtree_docker | String | The Docker image used to run the mashtree task |
@@ -74,6 +80,8 @@ By default, this task appends a Phandango coloring tag to color all items from t
| mashtree_wf_analysis_date | String | The date the workflow was run |
| mashtree_wf_version | String | The version of PHB the workflow is hosted in |

</div>

## References

> Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, <https://doi.org/10.21105/joss.01762>
41 changes: 40 additions & 1 deletion docs/workflows/phylogenetic_construction/snippy_streamline.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.2.0 | Yes; some optional features incompatible | Set-level |
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.3.0 | Yes; some optional features incompatible | Set-level |

## Snippy_Streamline_PHB

@@ -65,6 +65,8 @@ To run Snippy_Streamline, either a reference genome must be provided (`reference
- Using the core genome
- `core_genome` = true (as default)

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| snippy_streamline | **read1** | Array[File] | The forward read files | | Required |
@@ -133,6 +135,8 @@ To run Snippy_Streamline, either a reference genome must be provided (`reference
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Tasks

For automatic reference selection by the workflow (optional):
@@ -169,6 +173,36 @@ For all cases:

`Snippy_Variants` aligns reads for each sample against the reference genome. As part of `Snippy_Streamline`, the only output from this workflow is the `snippy_variants_outdir_tarball` which is provided in the set-level data table. Please see the full documentation for [Snippy_Variants](./snippy_variants.md) for more information.

This task also extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`). The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

!!! tip "QC Metrics for Phylogenetic Analysis"
These QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses

!!! techdetails "Snippy Variants Technical Details"
| | Links |
| --- | --- |
| Task | [task_snippy_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_snippy_variants.wdl) |
| Software Source Code | [Snippy on GitHub](https://github.com/tseemann/snippy) |
| Software Documentation | [Snippy on GitHub](https://github.com/tseemann/snippy) |

??? task "Snippy_Tree workflow"

##### Snippy_Tree {#snippy_tree}
@@ -179,6 +213,8 @@ For all cases:

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| snippy_centroid_docker | String | Docker file used for Centroid |
@@ -188,6 +224,7 @@ For all cases:
| snippy_centroid_version | String | Centroid version used |
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | The concatenated variants file |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Final phylogenetic tree produced by Snippy_Streamline |
@@ -223,3 +260,5 @@ For all cases:
| snippy_variants_snippy_docker | Array[String] | Docker file used for Snippy in the Snippy_Variants subworkfow |
| snippy_variants_snippy_version | Array[String] | Version of Snippy_Tree subworkflow used |
| snippy_wg_snp_matrix | File | CSV file of whole genome pairwise SNP distances between samples, calculated from the final alignment |

</div>
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.2.0 | Yes; some optional features incompatible | Set-level |
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.3.0 | Yes; some optional features incompatible | Set-level |

## Snippy_Streamline_FASTA_PHB

@@ -37,8 +37,46 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a

**If reference genomes have multiple contigs, they will not be compatible with using Gubbins** to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought.

### Workflow Tasks

??? task "Snippy_Variants QC Metrics Concatenation (optional)"

##### Snippy_Variants QC Metric Concatenation (optional) {#snippy_variants}

Optionally, the user can provide the `snippy_variants_qc_metrics` file produced by the Snippy_Variants workflow as input to the workflow to concatenate the reports for each sample in the tree. These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

!!! tip "QC Metrics for Phylogenetic Analysis"
These QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses, and we recommend examining them before proceeding with phylogenetic analysis if performing Snippy_Variants and Snippy_Tree separately.

!!! techdetails "Snippy Variants Technical Details"
| | Links |
| --- | --- |
| Task | [task_snippy_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_snippy_variants.wdl) |
| Software Source Code | [Snippy on GitHub](https://github.com/tseemann/snippy) |
| Software Documentation | [Snippy on GitHub](https://github.com/tseemann/snippy) |

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| snippy_streamline_fasta | **assembly_fasta** | Array[File] | The assembly files for your samples | | Required |
@@ -107,8 +145,12 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| snippy_centroid_docker | String | Docker file used for Centroid |
@@ -117,6 +159,7 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
| snippy_centroid_samplename | String | Name of the centroid sample |
| snippy_centroid_version | String | Centroid version used |
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_concatenated_variants | File | The concatenated variants file |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
@@ -151,3 +194,5 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
| snippy_variants_snippy_docker | Array[String] | Docker file used for Snippy in the Snippy_Variants subworkfow |
| snippy_variants_snippy_version | Array[String] | Version of Snippy_Tree subworkflow used |
| snippy_wg_snp_matrix | File | CSV file of whole genome pairwise SNP distances between samples, calculated from the final alignment |

</div>
47 changes: 45 additions & 2 deletions docs/workflows/phylogenetic_construction/snippy_tree.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.1.0 | Yes; some optional features incompatible | Set-level |
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.3.0 | Yes; some optional features incompatible | Set-level |

## Snippy_Tree_PHB

@@ -53,6 +53,8 @@ Sequencing data used in the Snippy_Tree workflow must:
- Using the core genome
- `core_genome` = true (as default)

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| snippy_tree_wf | **tree_name_updated** | String | Internal component, do not modify. Used for replacing spaces with underscores_ | | Do not modify |
@@ -123,6 +125,8 @@ Sequencing data used in the Snippy_Tree workflow must:
| wg_snp_dists | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| wg_snp_dists | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |

</div>

### Workflow Tasks

??? task "Snippy"
@@ -262,7 +266,7 @@ Sequencing data used in the Snippy_Tree workflow must:

| | Links |
| --- | --- |
| Task | [task_summarize_data.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/task_summarize_data.wdl) |
| Task | [task_summarize_data.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_summarize_data.wdl) |

??? task "Concatenate Variants (optional)"

@@ -306,12 +310,49 @@ Sequencing data used in the Snippy_Tree workflow must:
| Task | task_shared_variants.wdl |
| Software Source Code | [task_shared_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/phylogenetic_inference/utilities/task_shared_variants.wdl) |

??? task "Snippy_Variants QC Metrics Concatenation (optional)"

##### Snippy_Variants QC Metric Concatenation (optional) {#snippy_variants}

Optionally, the user can provide the `snippy_variants_qc_metrics` file produced by the Snippy_Variants workflow as input to the workflow to concatenate the reports for each sample in the tree. These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

!!! tip "QC Metrics for Phylogenetic Analysis"
These QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses, and we recommend examining them before proceeding with phylogenetic analysis if performing Snippy_Variants and Snippy_Tree separately.

!!! techdetails "Snippy Variants Technical Details"
| | Links |
| --- | --- |
| Task | [task_snippy_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_snippy_variants.wdl) |
| Software Source Code | [Snippy on GitHub](https://github.com/tseemann/snippy) |
| Software Documentation | [Snippy on GitHub](https://github.com/tseemann/snippy) |

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | Concatenated snippy_results file across all samples in the set |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false) |
@@ -336,6 +377,8 @@ Sequencing data used in the Snippy_Tree workflow must:
| snippy_tree_version | String | Version of Snippy_Tree workflow |
| snippy_wg_snp_matrix | File | CSV file of whole genome pairwise SNP distances between samples, calculated from the final alignment |

</div>

## References

> **Gubbins:** Croucher, Nicholas J., Andrew J. Page, Thomas R. Connor, Aidan J. Delaney, Jacqueline A. Keane, Stephen D. Bentley, Julian Parkhill, and Simon R. Harris. 2015. "Rapid Phylogenetic Analysis of Large Samples of Recombinant Bacterial Whole Genome Sequences Using Gubbins." Nucleic Acids Research 43 (3): e15.
46 changes: 44 additions & 2 deletions docs/workflows/phylogenetic_construction/snippy_variants.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria), [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics), [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.2.0 | Yes | Sample-level |
| [Phylogenetic Construction](../../workflows_overview/workflows_type.md/#phylogenetic-construction) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria), [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics), [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.3.0 | Yes | Sample-level |

## Snippy_Variants_PHB

@@ -29,6 +29,8 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
!!! info "Query String"
The query string can be a gene or any other annotation that matches the GenBank file/output VCF **EXACTLY**

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| snippy_variants_wf | **reference_genome_file** | File | Reference genome (GenBank file or fasta) | | Required |
@@ -54,9 +56,44 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Tasks

`Snippy_Variants` uses the snippy tool to align reads to the reference and call SNPs, MNPs and INDELs according to optional input parameters. The output includes a file of variants that is then queried using the `grep` bash command to identify any mutations in specified genes or annotations of interest. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the `snippy_results` column.
`Snippy_Variants` uses Snippy to align reads to the reference and call SNPs, MNPs and INDELs according to optional input parameters. The output includes a file of variants that is then queried using the `grep` bash command to identify any mutations in specified genes or annotations of interest. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the `snippy_results` column.

!!! info "Quality Control Metrics"
Additionally, `Snippy_Variants` extracts quality control (QC) metrics from the Snippy output for each sample. These per-sample QC metrics are saved in TSV files (`snippy_variants_qc_metrics`). The QC metrics include:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome; also available in the `snippy_variants_percent_reads_aligned` output column.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10); also available in the `snippy_variants_percent_ref_coverage` output column.
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

!!! tip "QC Metrics for Phylogenetic Analysis"
These QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses, and we recommend examining them before proceeding with phylogenetic analysis if performing Snippy_Variants and Snippy_Tree separately.

These per-sample QC metrics can also be combined into a single file (`snippy_combined_qc_metrics`) in downstream workflows, such as `snippy_tree`, providing an overview of QC metrics across all samples.

!!! techdetails "Snippy Variants Technical Details"
| | Links |
| --- | --- |
| Task | [task_snippy_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_snippy_variants.wdl)<br>[task_snippy_gene_query.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_snippy_gene_query.wdl) |
| Software Source Code | [Snippy on GitHub](https://github.com/tseemann/snippy) |
| Software Documentation | [Snippy on GitHub](https://github.com/tseemann/snippy) |

### Outputs

@@ -66,6 +103,8 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
!!! warning "Note on coverage calculations"
The outputs from `samtools coverage` (found in the `snippy_variants_coverage_tsv` file) may differ from the `snippy_variants_percent_ref_coverage` due to different calculation methods. `samtools coverage` computes genome-wide coverage metrics (e.g., the proportion of bases covered at depth ≥ 1), while `snippy_variants_percent_ref_coverage` uses a user-defined minimum coverage threshold (default is 10), calculating the proportion of the reference genome with a depth greater than or equal to this threshold.

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| snippy_variants_bai | File | Indexed bam file of the reads aligned to the reference |
@@ -79,9 +118,12 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
| snippy_variants_outdir_tarball | File | A compressed file containing the whole directory of snippy output files. This is used when running Snippy_Tree |
| snippy_variants_percent_reads_aligned | Float | Percentage of reads aligned to the reference genome |
| snippy_variants_percent_ref_coverage| Float | Proportion of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10). |
| snippy_variants_qc_metrics | File | TSV file containing quality control metrics for the sample |
| snippy_variants_query | String | Query strings specified by the user when running the workflow |
| snippy_variants_query_check | String | Verification that query strings are found in the reference genome |
| snippy_variants_results | File | CSV file detailing results for all mutations identified in the query sequence relative to the reference |
| snippy_variants_summary | File | A summary TXT fie showing the number of mutations identified for each mutation type |
| snippy_variants_version | String | Version of Snippy used |
| snippy_variants_wf_version | String | Version of Snippy_Variants used |

</div>
8 changes: 8 additions & 0 deletions docs/workflows/phylogenetic_placement/samples_to_ref_tree.md
Original file line number Diff line number Diff line change
@@ -17,6 +17,8 @@ However, nextclade can be used on any organism as long as an an existing, high-q

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| nextclade_addToRefTree | **assembly_fasta** | File | A fasta file with query sequence(s) to be placed onto the global tree | | Required |
@@ -34,8 +36,12 @@ However, nextclade can be used on any organism as long as an an existing, high-q
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| treeUpdate_auspice_json | File | Phylogenetic tree with user placed samples |
@@ -45,3 +51,5 @@ However, nextclade can be used on any organism as long as an an existing, high-q
| treeUpdate_nextclade_version | String | Nextclade version used |
| samples_to_ref_tree_analysis_date | String | Date of analysis |
| samples_to_ref_tree_version | String | Version of the Public Health Bioinformatics (PHB) repository used |

</div>
8 changes: 8 additions & 0 deletions docs/workflows/phylogenetic_placement/usher.md
Original file line number Diff line number Diff line change
@@ -14,6 +14,8 @@

While this workflow is technically a set-level workflow, it works on the sample-level too. When run on the set-level, the samples are placed with respect to each other.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| usher_workflow | **assembly_fasta** | Array[File] | The assembly files for the samples you want to place on the pre-existing; can either be a set of samples, an individual sample, or multiple individual samples | | Required |
@@ -29,8 +31,12 @@ While this workflow is technically a set-level workflow, it works on the sample-
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| usher_clades | File | The clades predicted for the samples |
@@ -41,3 +47,5 @@ While this workflow is technically a set-level workflow, it works on the sample-
| usher_subtrees | Array[File] | An array of subtrees where your samples have been placed |
| usher_uncondensed_tree | File | The entire global tree with your samples included (warning: may be a very large file if the organism is "sars-cov-2") |
| usher_version | String | The version of UShER used |

</div>
52 changes: 52 additions & 0 deletions docs/workflows/public_data_sharing/fetch_srr_accession.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Fetch SRR Accession Workflow

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Data Import](../../workflows_overview/workflows_type.md/#data-import) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.3.0 | Yes | Sample-level |

## Fetch SRR Accession

This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. The primary inputs are BioSample IDs (e.g., SAMN00000000) or SRA Experiment IDs (e.g., SRX000000), which link to sequencing data in the SRA repository.

The workflow uses the fastq-dl tool to fetch metadata from SRA and specifically parses this metadata to extract the associated SRR accession and outputs the SRR accession.

### Inputs

| **Terra Task Name** | **Variable** | **Type** | **Description**| **Default Value** | **Terra Status** |
| --- | --- | --- | --- | --- | --- |
| fetch_srr_metadata | **sample_accession** | String | SRA-compatible accession, such as a **BioSample ID** (e.g., "SAMN00000000") or **SRA Experiment ID** (e.g., "SRX000000"), used to retrieve SRR metadata. | | Required |
| fetch_srr_metadata | **cpu** | Int | Number of CPUs allocated for the task. | 2 | Optional |
| fetch_srr_metadata | **disk_size** | Int | Disk space in GB allocated for the task. | 10 | Optional |
| fetch_srr_metadata | **docker**| String | Docker image for metadata retrieval. | `us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0` | Optional |
| fetch_srr_metadata | **memory** | Int | Memory in GB allocated for the task. | 8 | Optional |
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

### Workflow Tasks

This workflow has a single task that performs metadata retrieval for the specified sample accession.

??? task "`fastq-dl`: Fetches SRR metadata for sample accession"
When provided a BioSample accession or SRA experiment ID, 'fastq-dl' collects metadata and returns the appropriate SRR accession.

!!! techdetails "fastq-dl Technical Details"
| | Links |
| --- | --- |
| Task | [Task on GitHub](https://github.com/theiagen-org/phb-workflows/blob/main/tasks/utilities/data_handling/task_fetch_srr_metadata.wdl) |
| Software Source Code | [fastq-dl Source](https://github.com/rvalieris/fastq-dl) |
| Software Documentation | [fastq-dl Documentation](https://github.com/rvalieris/fastq-dl#documentation) |
| Original Publication | [fastq-dl: A fast and reliable tool for downloading SRA metadata](https://doi.org/10.1186/s12859-021-04346-3) |

### Outputs

| **Variable** | **Type** | **Description**|
|---|---|---|
| srr_accession| String | The SRR accession's associated with the input sample accession.|
| fetch_srr_accession_version | String | The version of the fetch_srr_accession workflow. |
| fetch_srr_accession_analysis_date | String | The date the fetch_srr_accession analysis was run. |

## References

> Valieris, R. et al., "fastq-dl: A fast and reliable tool for downloading SRA metadata." Bioinformatics, 2021.
31 changes: 21 additions & 10 deletions docs/workflows/public_data_sharing/mercury_prep_n_batch.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.2.0 | Yes | Set-level |
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.3.0 | Yes | Set-level |

## Mercury_Prep_N_Batch_PHB

@@ -52,36 +52,41 @@ To help users collect all required metadata, we have created the following Excel

The `using_clearlabs_data` and `using_reads_dehosted` arguments change the default values for the `read1_column_name`, `assembly_fasta_column_name`, and `assembly_mean_coverage_column_name` metadata columns. The default values are shown in the table below in addition to what they are changed to depending on what arguments are used.

| Variable | Default Value | with `using_clearlabs_data` | with `using_reads_dehosted` | with both  `using_clearlabs_data` ***and*** `using_reads_dehosted` |
| Variable | Default Value | with `using_clearlabs_data` | with `using_reads_dehosted` | with both  `using_clearlabs_data` **_and_** `using_reads_dehosted` |
| --- | --- | --- | --- | --- |
| `read1_column_name` | `"read1_dehosted"` | `"clearlabs_fastq_gz"` | `"reads_dehosted"` | `"reads_dehosted"` |
| `assembly_fasta_column_name` | `"assembly_fasta"` | `"clearlabs_fasta"` | `"assembly_fasta"` | `"clearlabs_fasta"` |
| `assembly_mean_coverage_column_name` | `"assembly_mean_coverage"` | `"clearlabs_assembly_coverage"` | `"assembly_mean_coverage"` | `"clearlabs_assembly_coverage"` |
| `assembly_mean_coverage_column_name` | `"assembly_mean_coverage"` | `"clearlabs_sequencing_depth"` | `"assembly_mean_coverage"` | `"clearlabs_sequencing_depth"` |

### Inputs

!!! tip "Use the sample table for the `terra_table_name` input"
Make sure your entry for `terra_table_name` is for the _sample_ table! While the root entity needs to be the set table, the input value for `terra_table_name` should be the sample table.

This workflow runs on the set-level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| mercury_prep_n_batch | **gcp_bucket_uri** | String | Google bucket where your SRA reads will be temporarily stored before transferring to SRA. Example: "gs://theiagen_sra_transfer" | | Required |
| mercury_prep_n_batch | **sample_names** | Array[String] | The samples you want to submit | | Required |
| mercury_prep_n_batch | **terra_project_name** | String | The name of your Terra project. You can find this information in the URL of the webpage of your Terra dashboard. For example, if your URL contains #workspaces/example/my_workspace/ then your project name is example | | Required |
| mercury_prep_n_batch | **terra_table_name** | String | The name of the Terra table where your samples can be found. Do not include the entity: prefix or the _id suffix, just the name of the table as listed in the sidebar on lefthand side of the Terra Data tab. | | Required |
| mercury_prep_n_batch | **terra_project_name** | String | The name of your Terra project. You can find this information in the URL of the webpage of your Terra dashboard. For example, if your URL contains `#workspaces/example/my_workspace/` then your project name is `example` | | Required |
| mercury_prep_n_batch | **terra_table_name** | String | The name of the Terra table where your **samples** can be found. Do not include the `entity:` prefix, the `_id` suffix, or the `_set_id` suffix, just the name of the sample-level data table as listed in the sidebar on lefthand side of the Terra Data tab. | | Required |
| mercury_prep_n_batch | **terra_workspace_name** | String | The name of your Terra workspace where your samples can be found. For example, if your URL contains #workspaces/example/my_workspace/ then your project name is my_workspace | | Required |
| download_terra_table | **cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional |
| download_terra_table | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |
| download_terra_table | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21 | Optional |
| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 1 | Optional |
| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| mercury | **cpu** | Int | Number of CPUs to allocate to the task | 2 | Optional |
| mercury | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| mercury | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.0.7 | Optional |
| mercury | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
| mercury | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| mercury | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.0.9 | Optional |
| mercury | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional |
| mercury | **number_N_threshold** | Int | Only for "sars-cov-2" submissions; used to filter out any samples that contain more than the indicated number of Ns in the assembly file | 5000 | Optional |
| mercury | **single_end** | Boolean | Set to true if your data is single-end; this ensures that a read2 column is not included in the metadata | FALSE | Optional |
| mercury | **skip_county** | Boolean | Use if your Terra table contains a county column that you do not want to include in your submission. | FALSE | Optional |
| mercury | **usa_territory** | Boolean | If true, the "state" column will be used in place of the "country" column. For example, if "state" is Puerto Rico, then the GISAID virus name will be `hCoV-19/Puerto Rico/<name>/<year>`. The NCBI `geo_loc_name` will be "USA: Puerto Rico". This optional Boolean variable should only be used with clear understanding of what it does. | FALSE | Optional |
| mercury | **using_clearlabs_data** | Boolean | When set to true will change read1_dehosted → clearlabs_fastq_gz; assembly_fasta → clearlabs_fasta; assembly_mean_coverage → clearlabs_assembly_coverage | FALSE | Optional |
| mercury | **using_clearlabs_data** | Boolean | When set to `true` will change `read1_dehosted``clearlabs_fastq_gz`; `assembly_fasta``clearlabs_fasta`; `assembly_mean_coverage``clearlabs_sequencing_depth` | FALSE | Optional |
| mercury | **using_reads_dehosted** | Boolean | When set to true will only change read1_dehosted → reads_dehosted. Takes priority over the replacement for read1_dehosted made with the using_clearlabs_data Boolean input | FALSE | Optional |
| mercury | **vadr_alert_limit** | Int | Only for "sars-cov-2" submissions; used to filter out any samples that contain more than the indicated number of vadr alerts | 0 | Optional |
| mercury_prep_n_batch | **authors_sbt** | File | Only for "mpox" submissions; a file that contains author information. This file can be created here: <https://submit.ncbi.nlm.nih.gov/genbank/template/submission/> | | Optional |
@@ -101,8 +106,12 @@ This workflow runs on the set-level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| bankit_sqn_to_email | File | **Only for mpox submission**: the sqn file that you will use to submit mpox assembly files to NCBI via email |
@@ -117,6 +126,8 @@ This workflow runs on the set-level.
| mercury_script_version | String | Version of the Mercury tool that was used in this workflow |
| sra_metadata | File | SRA metadata TSV file for upload |

</div>

???+ toggle "An example excluded_samples.tsv file"

##### An example excluded_samples.tsv file {#example-excluded-samples}
8 changes: 8 additions & 0 deletions docs/workflows/public_data_sharing/terra_2_gisaid.md
Original file line number Diff line number Diff line change
@@ -28,6 +28,8 @@ The optional variable `frameshift_notification` has three options that correspon

This workflow runs on the sample level.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| Terra_2_GISAID | **client_id** | String | This value should be filled with the client-ID provided by GISAID | | Required |
@@ -43,12 +45,18 @@ This workflow runs on the sample level.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| failed_uploads | Boolean | The metadata for any failed uploads |
| gisaid_cli_version | String | The verison of the GISAID CLI tool |
| gisaid_logs | File | The log files regarding the submission |
| terra_2_gisaid_analysis_date | String | The date of the analysis |
| terra_2_gisaid_version | String | The version of the PHB repository that this workflow is hosted in |

</div>
10 changes: 9 additions & 1 deletion docs/workflows/public_data_sharing/terra_2_ncbi.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Bacteria](../../workflows_overview/workflows_kingdom.md#bacteria), [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics) [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.1.0 | No | Set-level |
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Bacteria](../../workflows_overview/workflows_kingdom.md#bacteria), [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics) [Viral](../../workflows_overview/workflows_kingdom.md/#viral) | PHB v2.3.0 | No | Set-level |

## Terra_2_NCBI_PHB

@@ -103,6 +103,8 @@ This workflow runs on set-level data tables.
!!! info "Production Submissions"
Please note that an optional Boolean variable, `submit_to_production`, is **required** for a production submission.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
| --- | --- | --- | --- | --- | --- |
| Terra_2_NCBI | **bioproject** | String | BioProject accession that the samples will be submitted to | | Required |
@@ -143,6 +145,8 @@ This workflow runs on set-level data tables.
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

??? task "Workflow Tasks"

##### Workflow Tasks {#workflow-tasks}
@@ -178,6 +182,8 @@ If the workflow ends unsuccessfully, no outputs will be shown on Terra and the `

The output files contain information mostly for debugging purposes. Additionally, if your submission is successful, the point of contact for the submission should also receive an email from NCBI notifying them of their submission success.

<div class="searchable-table" markdown="1">

| Variable | Description | Type |
| --- | --- | --- |
| biosample_failures | Text file listing samples that failed BioSample submission | File |
@@ -193,6 +199,8 @@ The output files contain information mostly for debugging purposes. Additionally
| terra_2_ncbi_analysis_date | Date that the workflow was run | String |
| terra_2_ncbi_version | Version of the PHB repository where the workflow is hosted | String |

</div>

???+ toggle "An example excluded_samples.tsv file"

##### An example excluded_samples.tsv file {#example-excluded-samples}
47 changes: 47 additions & 0 deletions docs/workflows/standalone/concatenate_illumina_lanes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Concatenate Illumina Lanes

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB 2.3.0 | Yes | Sample-level |

## Concatenate_Illumina_Lanes_PHB

Some Illumina machines produce multi-lane FASTQ files for a single sample. This workflow concatenates the multiple lanes into a single FASTQ file per read type (forward or reverse).

### Inputs

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| concatenate_illumina_lanes | **read1_lane1** | File | The first lane for the forward reads | | Required |
| concatenate_illumina_lanes | **read1_lane2** | File | The second lane for the forward reads | | Required |
| concatenate_illumina_lanes | **samplename** | String | The name of the sample, used to name the output files | | Required |
| cat_lanes | **cpu** | Int | Number of CPUs to allocate to the task | 2 | Optional |
| cat_lanes | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional |
| cat_lanes | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/utility:1.2" | Optional |
| cat_lanes | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
| concatenate_illumina_lanes | **read1_lane3** | File | The third lane for the forward reads | | Optional |
| concatenate_illumina_lanes | **read1_lane4** | File | The fourth lane for the forward reads | | Optional |
| concatenate_illumina_lanes | **read2_lane1** | File | The first lane for the reverse reads | | Optional |
| concatenate_illumina_lanes | **read2_lane2** | File | The second lane for the reverse reads | | Optional |
| concatenate_illumina_lanes | **read2_lane3** | File | The third lane for the reverse reads | | Optional |
| concatenate_illumina_lanes | **read2_lane4** | File | The fourth lane for the reverse reads | | Optional |
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

### Workflow Tasks

This workflow concatenates the Illumina lanes for forward and (if provided) reverse reads. The output files are named as followed:

- Forward reads: `<samplename>_merged_R1.fastq.gz`
- Reverse reads: `<samplename>_merged_R2.fastq.gz`

### Outputs

| **Variable** | **Type** | **Description** |
|---|---|---|
| concatenate_illumina_lanes_analysis_date | String | Date of analysis |
| concatenate_illumina_lanes_version | String | Version of PHB used for the analysis |
| read1_concatenated | File | Concatenated forward reads |
| read2_concatenated | File | Concatenated reverse reads |
10 changes: 9 additions & 1 deletion docs/workflows/standalone/gambit_query.md
Original file line number Diff line number Diff line change
@@ -12,6 +12,8 @@ The GAMBIT_Query_PHB workflow performs taxon assignment of a genome assembly usi

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| gambit_query | **assembly_fasta** | File | Assembly file in FASTA format | | Required |
@@ -23,6 +25,8 @@ The GAMBIT_Query_PHB workflow performs taxon assignment of a genome assembly usi
| gambit | **gambit_db_genomes** | File | Database of metadata for assembled query genomes; requires complementary signatures file. If not provided, uses default database "/gambit-db" | "gs://gambit-databases-rp/2.0.0/gambit-metadata-2.0.0-20240628.gdb" | Optional |
| gambit | **gambit_db_signatures** | File | Signatures file; requires complementary genomes file. If not specified, the file from the docker container will be used. | "gs://gambit-databases-rp/2.0.0/gambit-signatures-2.0.0-20240628.gs" | Optional |

</div>

### Workflow Tasks

[`GAMBIT`](https://github.com/jlumpe/gambit) determines the taxon of the genome assembly using a k-mer based approach to match the assembly sequence to the closest complete genome in a database, thereby predicting its identity. Sometimes, GAMBIT can confidently designate the organism to the species level. Other times, it is more conservative and assigns it to a higher taxonomic rank.
@@ -40,6 +44,8 @@ For additional details regarding the GAMBIT tool and a list of available GAMBIT

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| gambit_closest_genomes | File | CSV file listing genomes in the GAMBIT database that are most similar to the query assembly |
@@ -50,6 +56,8 @@ For additional details regarding the GAMBIT tool and a list of available GAMBIT
| gambit_query_wf_analysis_date | String | Date of analysis |
| gambit_query_wf_version | String | PHB repository version |
| gambit_report | File | GAMBIT report in a machine-readable format |
| gambit_version | String | Version of gambit software used
| gambit_version | String | Version of gambit software used |

</div>

> GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification. Lumpe et al. PLOS ONE, 2022. DOI: [10.1371/journal.pone.0277575](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0277575)
14 changes: 13 additions & 1 deletion docs/workflows/standalone/kraken2.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.0.0 | Yes | Sample-level |
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.3.0 | Yes | Sample-level |

## Kraken2 Workflows

@@ -30,6 +30,8 @@ Besides the data input types, there are minimal differences between these two wo

#### Suggested databases

<div class="searchable-table" markdown="1">

| Database name | Database Description | Suggested Applications | GCP URI (for usage in Terra) | Source | Database Size (GB) | Date of Last Update |
| --- | --- | --- | --- | --- | --- | --- |
| **Kalamari v5.1** | Kalamari is a database of complete public assemblies, that has been fine-tuned for enteric pathogens and is backed by trusted institutions. [Full list available here ( in chromosomes.tsv and plasmids.tsv)](https://github.com/lskatz/Kalamari/tree/master/src) | Single-isolate enteric bacterial pathogen analysis (Salmonella, Escherichia, Shigella, Listeria, Campylobacter, Vibrio, Yersinia) | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2.kalamari_5.1.tar.gz`** || 1.5 | 18/5/2022 |
@@ -40,8 +42,12 @@ Besides the data input types, there are minimal differences between these two wo
| **EuPathDB48** | Eukaryotic pathogen genomes with contaminants removed. [Full list available here](https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt) | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | **`gs://theiagen-public-files-rp/terra/theiaprok-files/k2_eupathdb48_20201113.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 30.3 | 13/11/2020 |
| **EuPathDB48** | Eukaryotic pathogen genomes with contaminants removed. [Full list available here](https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt) | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | **`gs://theiagen-large-public-files-rp/terra/databases/kraken/k2_eupathdb48_20230407.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 11 | 7/4/2023 |

</div>

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | **Workflow** |
|---|---|---|---|---|---|---|
| *workflow_name | **kraken2_db** | File | A Kraken2 database in .tar.gz format | | Required | ONT, PE, SE |
@@ -67,8 +73,12 @@ Besides the data input types, there are minimal differences between these two wo
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | ONT, PE, SE |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional | ONT, PE, SE |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| kraken2_classified_read1 | File | FASTQ file of classified forward/R1 reads |
@@ -85,6 +95,8 @@ Besides the data input types, there are minimal differences between these two wo
| krona_html | File | HTML report of krona with visualisation of taxonomic classification of reads (if PE or SE) |
| krona_version | String | krona version (if PE or SE) |

</div>

#### Interpretation of results

The most important outputs of the Kraken2 workflows are the `kraken2_report` files. These will include a breakdown of the number of sequences assigned to a particular taxon, and the percentage of reads assigned. [A complete description of the report format can be found here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format).
8 changes: 8 additions & 0 deletions docs/workflows/standalone/ncbi_amrfinderplus.md
Original file line number Diff line number Diff line change
@@ -19,6 +19,8 @@ You can check if a gene or point mutation is in the AMRFinderPlus database [here

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| amrfinderplus_wf | **assembly** | File | Genome assembly file in FASTA format. Can be generated by TheiaProk workflow or other bioinformatics workflows. | | Required |
@@ -35,8 +37,12 @@ You can check if a gene or point mutation is in the AMRFinderPlus database [here
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| amrfinderplus_all_report | File | Output TSV file from AMRFinderPlus (described [here](https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus#fields)) |
@@ -54,6 +60,8 @@ You can check if a gene or point mutation is in the AMRFinderPlus database [here
| amrfinderplus_wf_analysis_date | String | Date of analysis |
| amrfinderplus_wf_version | String | Version of PHB used for the analysis |

</div>

## References

>Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021 Jun 16;11(1):12728. doi: 10.1038/s41598-021-91456-0. PMID: 34135355; PMCID: PMC8208984. <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8208984/>
10 changes: 9 additions & 1 deletion docs/workflows/standalone/ncbi_scrub.md
Original file line number Diff line number Diff line change
@@ -16,11 +16,14 @@ There are three Kraken2 workflows:

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | **Workflow** |
|---|---|---|---|---|---|---|
| dehost_pe or dehost_se | **read1** | File | | | Required | PE, SE |
| dehost_pe or dehost_se | **read2** | File | | | Required | PE |
| dehost_pe or dehost_se | **samplename** | String | | | Required | PE, SE |
| dehost_pe or dehost_se | **target_organism** | String | Target organism for Kraken2 reporting | "Severe acute respiratory syndrome coronavirus 2" | Optional | PE, SE |
| kraken2 | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
| kraken2 | **disk_size** | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | PE, SE |
| kraken2 | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | PE, SE |
@@ -35,6 +38,8 @@ There are three Kraken2 workflows:
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | PE, SE |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional | PE, SE |

</div>

### Workflow Tasks

This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.
@@ -62,13 +67,15 @@ This workflow is composed of two tasks, one to dehost the input reads and anothe
| | Links |
| --- | --- |
| Task | [task_kraken2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/task_kraken2.wdl) |
| Task | [task_kraken2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/contamination/task_kraken2.wdl) |
| Software Source Code | [Kraken2 on GitHub](https://github.com/DerrickWood/kraken2/) |
| Software Documentation | <https://github.com/DerrickWood/kraken2/wiki> |
| Original Publication(s) | [Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) |

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** | **Workflow** |
|---|---|---|---|
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | PE, SE |
@@ -82,3 +89,4 @@ This workflow is composed of two tasks, one to dehost the input reads and anothe
| read1_dehosted | File | Dehosted forward reads | PE, SE |
| read2_dehosted | File | Dehosted reverse reads | PE |

</div>
8 changes: 8 additions & 0 deletions docs/workflows/standalone/rasusa.md
Original file line number Diff line number Diff line change
@@ -27,6 +27,8 @@ RASUSA functions to randomly downsample the number of raw reads to a user-define

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Attribute** | **Terra Status** |
|---|---|---|---|---|---|
| rasusa_workflow | **coverage** | Float | Use to specify the desired coverage of reads after downsampling; actual coverage of subsampled reads will not be exact and may be slightly higher; always check the estimated clean coverage after performing downstream workflows to verify coverage values, when necessary | | Required |
@@ -45,8 +47,12 @@ RASUSA functions to randomly downsample the number of raw reads to a user-define
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| rasusa_version | String | Version of RASUSA used for the analysis |
@@ -55,6 +61,8 @@ RASUSA functions to randomly downsample the number of raw reads to a user-define
| read1_subsampled | File | New read1 FASTQ files downsampled to desired coverage |
| read2_subsampled | File | New read2 FASTQ files downsampled to desired coverage |

</div>

!!! tip "Don't Forget!"
Remember to use the subsampled reads in downstream analyses with `this.read1_subsampled` and `this.read2_subsampled` inputs.

4 changes: 4 additions & 0 deletions docs/workflows/standalone/rename_fastq.md
Original file line number Diff line number Diff line change
@@ -12,6 +12,8 @@ This sample-level workflow receives a read file or a pair of read files (FASTQ),

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| rename_fastq_files | **new_filename** | String | New name for the FASTQ file(s) | | Required |
@@ -24,6 +26,8 @@ This sample-level workflow receives a read file or a pair of read files (FASTQ),
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Outputs

If a reverse read (`read2`) is provided, the files get renamed to the provided `new_filename` input with the notation `<new_filename>_R1.fastq.gz` and `<new_filename>_R2.fastq.gz`. If only `read1` is provided, the file is renamed to `<new_filename>.fastq.gz`.
12 changes: 10 additions & 2 deletions docs/workflows/standalone/tbprofiler_tngs.md
Original file line number Diff line number Diff line change
@@ -4,14 +4,16 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.0.0 | Yes | Sample-level |
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.3.0 | Yes | Sample-level |

## TBProfiler_tNGS_PHB

This workflow is still in experimental research stages. Documentation is minimal as changes may occur in the code; it will be fleshed out when a stable state has been achieved.

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| tbprofiler_tngs | **read1** | File | Illumina forward read file in FASTQ file format (compression optional) | | Required |
@@ -21,7 +23,7 @@ This workflow is still in experimental research stages. Documentation is minimal
| tbp_parser | **coverage_threshold** | Int | The minimum percentage of a region to exceed the minimum depth for a region to pass QC in tbp_parser | 100 | Optional |
| tbp_parser | **cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional |
| tbp_parser | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| tbp_parser | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.6.0 | Optional |
| tbp_parser | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:2.2.2 | Optional |
| tbp_parser | **etha237_frequency** | Float | Minimum frequency for a mutation in ethA at protein position 237 to pass QC in tbp-parser | 0.1 | Optional |
| tbp_parser | **expert_rule_regions_bed** | File | A file that contains the regions where R mutations and expert rules are applied | | Optional |
| tbp_parser | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional |
@@ -62,8 +64,12 @@ This workflow is still in experimental research stages. Documentation is minimal
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Terra Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| tbp_parser_average_genome_depth | Float | The mean depth of coverage across all target regions included in the analysis |
@@ -95,3 +101,5 @@ This workflow is still in experimental research stages. Documentation is minimal
| trimmomatic_read2_trimmed | File | The read2 file post trimming |
| trimmomatic_stats | File | The read trimming statistics |
| trimmomatic_version | String | The version of trimmomatic used in this analysis |

</div>
8 changes: 8 additions & 0 deletions docs/workflows/standalone/theiavalidate.md
Original file line number Diff line number Diff line change
@@ -39,6 +39,8 @@ If a column consists of only GCP URIs (Google Cloud file paths), the files will

### Inputs

<div class="searchable-table" markdown="1">

Please note that all string inputs **must** be enclosed in quotation marks; for example, "column1,column2" or "workspace1"

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
@@ -62,6 +64,8 @@ Please note that all string inputs **must** be enclosed in quotation marks; for
| export_two_tsvs | **cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional |
| export_two_tsvs | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 10 | Optional |

</div>

The optional `validation_criteria_tsv` file takes the following format (tab-delimited; _a header line is required_):

```text linenums="1"
@@ -95,6 +99,8 @@ Please note that the name in the **second column** will be displayed and used in

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| theiavalidate_criteria_differences | File | A TSV file that lists only the differences that fail to meet the validation criteria |
@@ -108,6 +114,8 @@ Please note that the name in the **second column** will be displayed and used in
| theiavalidate_version | String | The version of the TheiaValidate Python Docker |
| theiavalidate_wf_version | String | The version of the PHB repository |

</div>

### Example Data and Outputs

To help demonstrate how TheiaValidate works, please observe the following example and outputs:
35 changes: 21 additions & 14 deletions docs/workflows_overview/workflows_alphabetically.md

Large diffs are not rendered by default.

52 changes: 34 additions & 18 deletions docs/workflows_overview/workflows_kingdom.md

Large diffs are not rendered by default.

58 changes: 44 additions & 14 deletions docs/workflows_overview/workflows_type.md

Large diffs are not rendered by default.

21 changes: 13 additions & 8 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ nav:
- Freyja Workflow Series: workflows/genomic_characterization/freyja.md
- Pangolin_Update: workflows/genomic_characterization/pangolin_update.md
- TheiaCoV Workflow Series: workflows/genomic_characterization/theiacov.md
- TheiaEuk: workflows/genomic_characterization/theiaeuk.md
- TheiaEuk Workflow Series: workflows/genomic_characterization/theiaeuk.md
- TheiaMeta: workflows/genomic_characterization/theiameta.md
- TheiaProk Workflow Series: workflows/genomic_characterization/theiaprok.md
- VADR_Update: workflows/genomic_characterization/vadr_update.md
@@ -43,6 +43,7 @@ nav:
- Samples_to_Ref_Tree: workflows/phylogenetic_placement/samples_to_ref_tree.md
- Usher_PHB: workflows/phylogenetic_placement/usher.md
- Public Data Sharing:
- Fetch_SRR_Accession: workflows/public_data_sharing/fetch_srr_accession.md
- Mercury_Prep_N_Batch: workflows/public_data_sharing/mercury_prep_n_batch.md
- Terra_2_GISAID: workflows/public_data_sharing/terra_2_gisaid.md
- Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md
@@ -52,6 +53,7 @@ nav:
- Zip_Column_Content: workflows/data_export/zip_column_content.md
- Standalone:
- Cauris_CladeTyper: workflows/standalone/cauris_cladetyper.md
- Concatenate_Illumina_Lanes: workflows/standalone/concatenate_illumina_lanes.md
- GAMBIT_Query: workflows/standalone/gambit_query.md
- Kraken2: workflows/standalone/kraken2.md
- NCBI-AMRFinderPlus: workflows/standalone/ncbi_amrfinderplus.md
@@ -65,7 +67,8 @@ nav:
- Any Taxa:
- Assembly_Fetch: workflows/data_import/assembly_fetch.md
- BaseSpace_Fetch: workflows/data_import/basespace_fetch.md
- Concatenate_Column_Content: workflows/data_export/concatenate_column_content.md
- Concatenate_Column_Content: workflows/data_export/concatenate_column_content.md
- Concatenate_Illumina_Lanes: workflows/standalone/concatenate_illumina_lanes.md
- Create_Terra_Table: workflows/data_import/create_terra_table.md
- Kraken2: workflows/standalone/kraken2.md
- NCBI-Scrub: workflows/standalone/ncbi_scrub.md
@@ -100,7 +103,7 @@ nav:
- NCBI-AMRFinderPlus: workflows/standalone/ncbi_amrfinderplus.md
- Snippy_Variants: workflows/phylogenetic_construction/snippy_variants.md
- Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md
- TheiaEuk: workflows/genomic_characterization/theiaeuk.md
- TheiaEuk Workflow Series: workflows/genomic_characterization/theiaeuk.md
- Viral:
- Augur: workflows/phylogenetic_construction/augur.md
- CZGenEpi_Prep: workflows/phylogenetic_construction/czgenepi_prep.md
@@ -123,6 +126,7 @@ nav:
- BaseSpace_Fetch: workflows/data_import/basespace_fetch.md
- Cauris_CladeTyper: workflows/standalone/cauris_cladetyper.md
- Concatenate_Column_Content: workflows/data_export/concatenate_column_content.md
- Concatenate_Illumina_Lanes: workflows/standalone/concatenate_illumina_lanes.md
- Core_Gene_SNP: workflows/phylogenetic_construction/core_gene_snp.md
- Create_Terra_Table: workflows/data_import/create_terra_table.md
- CZGenEpi_Prep: workflows/phylogenetic_construction/czgenepi_prep.md
@@ -149,7 +153,7 @@ nav:
- Terra_2_GISAID: workflows/public_data_sharing/terra_2_gisaid.md
- Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md
- TheiaCoV Workflow Series: workflows/genomic_characterization/theiacov.md
- TheiaEuk: workflows/genomic_characterization/theiaeuk.md
- TheiaEuk Workflow Series: workflows/genomic_characterization/theiaeuk.md
- TheiaMeta: workflows/genomic_characterization/theiameta.md
- TheiaProk Workflow Series: workflows/genomic_characterization/theiaprok.md
- TheiaValidate: workflows/standalone/theiavalidate.md
@@ -230,11 +234,12 @@ plugins:
# - section-index

extra_javascript:
- https://unpkg.com/tablesort@5.3.0/dist/tablesort.min.js
- javascripts/tablesort.js
- https://unpkg.com/tablesort@5.3.0/dist/tablesort.min.js
- javascripts/tablesort.js
- javascripts/table-search.js

extra_css:
- stylesheets/extra.css
- stylesheets/extra.css

extra:
social:
@@ -251,4 +256,4 @@ extra:
homepage: https://www.theiagen.com

copyright: |
&copy; 2022-2024 <a href="https://www.theiagen.com" target="_blank" rel="noopener">Theiagen Genomics</a>
&copy; 2022-2024 <a href="https://www.theiagen.com" target="_blank" rel="noopener">Theiagen Genomics</a>
10 changes: 8 additions & 2 deletions tasks/assembly/task_artic_consensus.wdl
Original file line number Diff line number Diff line change
@@ -12,7 +12,7 @@ task consensus {
Int memory = 16
Int disk_size = 100
String medaka_model = "r941_min_high_g360"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019-epi2me"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/artic:1.2.4-1.12.0"
}
String primer_name = basename(primer_bed)
command <<<
@@ -61,7 +61,13 @@ task consensus {
# version control
echo "Medaka via $(artic -v)" | tee VERSION
echo "~{primer_name}" | tee PRIMER_NAME
artic minion --medaka --medaka-model ~{medaka_model} --normalise ~{normalise} --threads ~{cpu} --scheme-directory ./primer-schemes --read-file ~{read1} ${scheme_name} ~{samplename}
artic minion \
--medaka \
--medaka-model ~{medaka_model} \
--normalise ~{normalise} \
--threads ~{cpu} \
--scheme-directory ./primer-schemes \
--read-file ~{read1} ${scheme_name} ~{samplename}
gunzip -f ~{samplename}.pass.vcf.gz

# clean up fasta header
23 changes: 20 additions & 3 deletions tasks/assembly/task_irma.wdl
Original file line number Diff line number Diff line change
@@ -87,9 +87,26 @@ task irma {
echo "Type_"$(basename "$(echo "$(find ~{samplename}/*.fasta | head -n1)")" | cut -d_ -f1) > IRMA_TYPE
# set irma_type bash variable which is used later
irma_type=$(cat IRMA_TYPE)
# concatenate consensus assemblies into single file with all genome segments
echo "DEBUG: creating IRMA FASTA file containing all segments...."
cat ~{samplename}/*.fasta > ~{samplename}.irma.consensus.fasta

# flu segments from largest to smallest
segments=("PB2" "PB1" "PA" "HA" "NP" "NA" "MP" "NS")

echo "DEBUG: creating IRMA FASTA file containing all segments in order (largest to smallest)...."

# initialize an empty file
touch ~{samplename}.irma.consensus.fasta

# concatenate files in the order of the segments array
for segment in "${segments[@]}"; do
segment_file=$(find "~{samplename}" -name "*${segment}*.fasta")
if [ -n "$segment_file" ]; then
echo "DEBUG: Adding $segment_file to consensus FASTA"
cat "$segment_file" >> ~{samplename}.irma.consensus.fasta
else
echo "WARNING: No file containing ${segment} found for ~{samplename}"
fi
done

echo "DEBUG: editing IRMA FASTA file to include sample name in FASTA headers...."
sed -i "s/>/>~{samplename}_/g" ~{samplename}.irma.consensus.fasta

40 changes: 37 additions & 3 deletions tasks/gene_typing/variant_detection/task_snippy_variants.wdl
Original file line number Diff line number Diff line change
@@ -89,19 +89,52 @@ task snippy_variants {
if [ "$reference_length" -eq 0 ]; then
echo "Could not compute percent reference coverage: reference length is 0" > PERCENT_REF_COVERAGE
else
# compute percent reference coverage
echo $reference_length_passed_depth $reference_length | awk '{ print ($1/$2)*100 }' > PERCENT_REF_COVERAGE
echo $reference_length_passed_depth $reference_length | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_REF_COVERAGE
fi

# Compute percentage of reads aligned
reads_aligned=$(cat READS_ALIGNED_TO_REFERENCE)
total_reads=$(samtools view -c "~{samplename}/~{samplename}.bam")
echo $total_reads > TOTAL_READS
if [ "$total_reads" -eq 0 ]; then
echo "Could not compute percent reads aligned: total reads is 0" > PERCENT_READS_ALIGNED
else
echo $reads_aligned $total_reads | awk '{ print ($1/$2)*100 }' > PERCENT_READS_ALIGNED
echo $reads_aligned $total_reads | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_READS_ALIGNED
fi

# Create QC metrics file
line_count=$(wc -l < "~{samplename}/~{samplename}_coverage.tsv")
# Check the number of lines in the coverage file, to consider scenarios e.g. for V. cholerae that has two chromosomes and therefore coverage metrics per chromosome
if [ "$line_count" -eq 2 ]; then
head -n 1 "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_HEADER
sed -n '2p' "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_VALUES
elif [ "$line_count" -gt 2 ]; then
# Multiple chromosomes (header + multiple data lines)
header=$(head -n 1 "~{samplename}/~{samplename}_coverage.tsv")
output_header=""
output_values=""
# while loop to iterate over each line in the coverage file
while read -r line; do
if [ -z "$output_header" ]; then
output_header="$header"
output_values="$line"
else
output_header="$output_header\t$header"
output_values="$output_values\t$line"
fi
done < <(tail -n +2 "~{samplename}/~{samplename}_coverage.tsv")
echo "$output_header" | tr ' ' '\t' > COVERAGE_HEADER
echo "$output_values" | tr ' ' '\t' > COVERAGE_VALUES
else
# Coverage file has insufficient data
echo "Coverage file has insufficient data." > COVERAGE_HEADER
echo "" > COVERAGE_VALUES
fi

# Build the QC metrics file
echo -e "samplename\treads_aligned_to_reference\ttotal_reads\tpercent_reads_aligned\tvariants_total\tpercent_ref_coverage\t$(cat COVERAGE_HEADER)" > "~{samplename}/~{samplename}_qc_metrics.tsv"
echo -e "~{samplename}\t$reads_aligned\t$total_reads\t$(cat PERCENT_READS_ALIGNED)\t$(cat VARIANTS_TOTAL)\t$(cat PERCENT_REF_COVERAGE)\t$(cat COVERAGE_VALUES)" >> "~{samplename}/~{samplename}_qc_metrics.tsv"

>>>
output {
String snippy_variants_version = read_string("VERSION")
@@ -120,6 +153,7 @@ task snippy_variants {
String snippy_variants_ref_length = read_string("REFERENCE_LENGTH")
String snippy_variants_ref_length_passed_depth = read_string("REFERENCE_LENGTH_PASSED_DEPTH")
String snippy_variants_percent_ref_coverage = read_string("PERCENT_REF_COVERAGE")
File snippy_variants_qc_metrics = "~{samplename}/~{samplename}_qc_metrics.tsv"
String snippy_variants_percent_reads_aligned = read_string("PERCENT_READS_ALIGNED")
}
runtime {
6 changes: 6 additions & 0 deletions tasks/phylogenetic_inference/augur/task_augur_align.wdl
Original file line number Diff line number Diff line change
@@ -12,8 +12,13 @@ task augur_align {
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/augur:22.0.2--pyhdfd78af_0"
}
command <<<
set -euo pipefail

# capture version information
augur version > VERSION
echo
echo "mafft version:"
mafft --version 2>&1 | tee MAFFT_VERSION

# run augur align
augur align \
@@ -26,6 +31,7 @@ task augur_align {
output {
File aligned_fasta = "alignment.fasta"
String augur_version = read_string("VERSION")
String mafft_version = read_string("MAFFT_VERSION")
}
runtime {
docker: docker
44 changes: 44 additions & 0 deletions tasks/phylogenetic_inference/augur/task_augur_tree.wdl
Original file line number Diff line number Diff line change
@@ -16,8 +16,30 @@ task augur_tree {
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/augur:22.0.2--pyhdfd78af_0"
}
command <<<
set -euo pipefail

# capture version information
augur version > VERSION
echo

# touch the version files to ensure they exist (so that read_string output function doesn't fail)
touch IQTREE_VERSION FASTTREE_VERSION RAXML_VERSION

# capture version information only for the method selected by user OR default of iqtree
if [ "~{method}" == "iqtree" ]; then
echo "iqtree version:"
iqtree --version | grep version | sed 's/.*version/version/;s/ for Linux.*//' | tee IQTREE_VERSION
elif [ "~{method}" == "fasttree" ]; then
echo "fasttree version:"
# fasttree prints to STDERR, so we need to redirect it to STDOUT, then grep for line with version info, then cut to extract version number (and nothing else)
fasttree -help 2>&1 | grep -m 1 "FastTree" | cut -d ' ' -f 2 | tee FASTTREE_VERSION
elif [ "~{method}" == "raxml" ]; then
echo "raxml version:"
raxmlHPC -v | grep RAxML | sed -e 's/.*RAxML version //' -e 's/released.*//' | tee RAXML_VERSION
fi

echo
echo "Running augur tree now..."

AUGUR_RECURSION_LIMIT=10000 augur tree \
--alignment "~{aligned_fasta}" \
@@ -28,10 +50,32 @@ task augur_tree {
~{"--tree-builder-args " + tree_builder_args} \
~{true="--override-default-args" false="" override_default_args} \
--nthreads auto

# If iqtree, get the model used
if [ "~{method}" == "iqtree" ]; then
if [ "~{substitution_model}" == "auto" ]; then
FASTA_BASENAME=$(basename ~{aligned_fasta} .fasta)
FASTA_DIR=$(dirname ~{aligned_fasta})
MODEL=$(grep "Best-fit model:" ${FASTA_DIR}/*${FASTA_BASENAME}-delim.iqtree.log | sed 's|Best-fit model: ||g;s|chosen.*||' | tr -d '\n\r')
else
MODEL="~{substitution_model}"
fi
echo "$MODEL" > FINAL_MODEL.txt
else
echo "" > FINAL_MODEL.txt
fi

echo
echo "DEBUG: FINAL_MODEL.txt is: $(cat FINAL_MODEL.txt)"
>>>

output {
File aligned_tree = "~{build_name}_~{method}.nwk"
String augur_version = read_string("VERSION")
String iqtree_version = read_string("IQTREE_VERSION")
String fasttree_version = read_string("FASTTREE_VERSION")
String raxml_version = read_string("RAXML_VERSION")
String iqtree_model_used = read_string("FINAL_MODEL.txt")
}
runtime {
docker: docker
35 changes: 33 additions & 2 deletions tasks/quality_control/basic_statistics/task_assembly_metrics.wdl
Original file line number Diff line number Diff line change
@@ -14,11 +14,11 @@ task stats_n_coverage {
samtools --version | head -n1 | tee VERSION

samtools stats ~{bamfile} > ~{samplename}.stats.txt

samtools coverage ~{bamfile} -m -o ~{samplename}.cov.hist
samtools coverage ~{bamfile} -o ~{samplename}.cov.txt
samtools flagstat ~{bamfile} > ~{samplename}.flagstat.txt

# Extracting coverage, depth, meanbaseq, and meanmapq
coverage=$(cut -f 6 ~{samplename}.cov.txt | tail -n 1)
depth=$(cut -f 7 ~{samplename}.cov.txt | tail -n 1)
meanbaseq=$(cut -f 8 ~{samplename}.cov.txt | tail -n 1)
@@ -33,6 +33,34 @@ task stats_n_coverage {
echo $depth | tee DEPTH
echo $meanbaseq | tee MEANBASEQ
echo $meanmapq | tee MEANMAPQ

# Parsing stats.txt for total and mapped reads
total_reads=$(grep "^SN" ~{samplename}.stats.txt | grep "raw total sequences:" | cut -f 3)
mapped_reads=$(grep "^SN" ~{samplename}.stats.txt | grep "reads mapped:" | cut -f 3)

# Check for empty values and set defaults to avoid errors
if [ -z "$total_reads" ]; then total_reads="1"; fi # Avoid division by zero
if [ -z "$mapped_reads" ]; then mapped_reads="0"; fi

# Calculate the percentage of mapped reads
percentage_mapped_reads=$(awk "BEGIN {printf \"%.2f\", ($mapped_reads / $total_reads) * 100}")

# If the percentage calculation fails, default to 0.0
if [ -z "$percentage_mapped_reads" ]; then percentage_mapped_reads="0.0"; fi

# Output the result
echo $percentage_mapped_reads | tee PERCENTAGE_MAPPED_READS

#output all metrics in one txt file
# Output header row (for CSV)
echo -e "Statistic\tValue" > ~{samplename}_metrics.txt

# Output each statistic as a row
echo -e "Coverage\t$coverage" >> ~{samplename}_metrics.txt
echo -e "Depth\t$depth" >> ~{samplename}_metrics.txt
echo -e "Mean Base Quality\t$meanbaseq" >> ~{samplename}_metrics.txt
echo -e "Mean Mapping Quality\t$meanmapq" >> ~{samplename}_metrics.txt
echo -e "Percentage Mapped Reads\t$percentage_mapped_reads" >> ~{samplename}_metrics.txt
>>>
output {
String date = read_string("DATE")
@@ -45,6 +73,9 @@ task stats_n_coverage {
Float depth = read_string("DEPTH")
Float meanbaseq = read_string("MEANBASEQ")
Float meanmapq = read_string("MEANMAPQ")
Float percentage_mapped_reads = read_string("PERCENTAGE_MAPPED_READS")
File metrics_txt = "~{samplename}_metrics.txt"

}
runtime {
docker: docker
@@ -55,4 +86,4 @@ task stats_n_coverage {
preemptible: 0
maxRetries: 3
}
}
}
65 changes: 41 additions & 24 deletions tasks/quality_control/basic_statistics/task_fastq_scan.wdl
Original file line number Diff line number Diff line change
@@ -6,14 +6,16 @@ task fastq_scan_pe {
File read2
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
String read2_name = basename(basename(basename(read2, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Int disk_size = 50
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:1.0.1--h4ac6f70_3"
Int memory = 2
Int cpu = 2
Int cpu = 1
}
command <<<
# capture date and version
date | tee DATE
# exit task in case anything fails in one-liners or variables are unset
set -euo pipefail

# capture version
fastq-scan -v | tee VERSION

# set cat command based on compression
@@ -24,11 +26,21 @@ task fastq_scan_pe {
fi

# capture forward read stats
echo "DEBUG: running fastq-scan on $(basename ~{read1})"
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read1_name}_fastq-scan.json > READ1_SEQS
echo "DEBUG: number of reads in $(basename ~{read1}): $(cat READ1_SEQS)"
read1_seqs=$(cat READ1_SEQS)
echo

# capture reverse read stats
echo "DEBUG: running fastq-scan on $(basename ~{read2})"
eval "${cat_reads} ~{read2}" | fastq-scan | tee ~{read2_name}_fastq-scan.json
cat ~{read2_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ2_SEQS

# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read2_name}_fastq-scan.json > READ2_SEQS
echo "DEBUG: number of reads in $(basename ~{read2}): $(cat READ2_SEQS)"
read2_seqs=$(cat READ2_SEQS)

# capture number of read pairs
@@ -37,26 +49,27 @@ task fastq_scan_pe {
else
read_pairs="Uneven pairs: R1=${read1_seqs}, R2=${read2_seqs}"
fi

echo $read_pairs | tee READ_PAIRS

# use simple redirect so STDOUT is not confusing
echo "$read_pairs" > READ_PAIRS
echo "DEBUG: number of read pairs: $(cat READ_PAIRS)"
>>>
output {
File read1_fastq_scan_report = "~{read1_name}_fastq-scan.json"
File read2_fastq_scan_report = "~{read2_name}_fastq-scan.json"
File read1_fastq_scan_json = "~{read1_name}_fastq-scan.json"
File read2_fastq_scan_json = "~{read2_name}_fastq-scan.json"
Int read1_seq = read_int("READ1_SEQS")
Int read2_seq = read_int("READ2_SEQS")
String read_pairs = read_string("READ_PAIRS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: docker
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
disk: disk_size + " GB"
preemptible: 1
maxRetries: 3
}
}
@@ -65,14 +78,16 @@ task fastq_scan_se {
input {
File read1
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
Int disk_size = 50
Int memory = 2
Int cpu = 2
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Int cpu = 1
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:1.0.1--h4ac6f70_3"
}
command <<<
# capture date and version
date | tee DATE
# exit task in case anything fails in one-liners or variables are unset
set -euo pipefail

# capture version
fastq-scan -v | tee VERSION

# set cat command based on compression
@@ -83,23 +98,25 @@ task fastq_scan_se {
fi

# capture forward read stats
echo "DEBUG: running fastq-scan on $(basename ~{read1})"
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read1_name}_fastq-scan.json > READ1_SEQS
echo "DEBUG: number of reads in $(basename ~{read1}): $(cat READ1_SEQS)"
>>>
output {
File fastq_scan_report = "~{read1_name}_fastq-scan.json"
File fastq_scan_json = "~{read1_name}_fastq-scan.json"
Int read1_seq = read_int("READ1_SEQS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: docker
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
disk: disk_size + " GB"
preemptible: 1
maxRetries: 3
}
}
51 changes: 26 additions & 25 deletions tasks/quality_control/comparisons/task_screen.wdl
Original file line number Diff line number Diff line change
@@ -20,6 +20,9 @@ task check_reads {
Int cpu = 1
}
command <<<
# just in case anything fails, throw an error
set -euo pipefail

flag="PASS"

# initalize estimated genome length
@@ -34,13 +37,13 @@ task check_reads {
fi

# check one: number of reads
read1_num=`eval "$cat_reads ~{read1}" | awk '{s++}END{print s/4}'`
read2_num=`eval "$cat_reads ~{read2}" | awk '{s++}END{print s/4}'`
# awk '{s++}END{print s/4' counts the number of lines and divides them by 4
# key assumption: in fastq there will be four lines per read
# sometimes fastqs do not have 4 lines per read, so this might fail one day
read1_num=$($cat_reads ~{read1} | fastq-scan | grep 'read_total' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
read2_num=$($cat_reads ~{read2} | fastq-scan | grep 'read_total' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
echo "DEBUG: Number of reads in R1: ${read1_num}"
echo "DEBUG: Number of reads in R2: ${read2_num}"

reads_total=$(expr $read1_num + $read2_num)
echo "DEBUG: Number of reads total in R1 and R2: ${reads_total}"

if [ "${reads_total}" -le "~{min_reads}" ]; then
flag="FAIL; the total number of reads is below the minimum of ~{min_reads}"
@@ -51,13 +54,11 @@ task check_reads {
# checks two and three: number of basepairs and proportion of sequence
if [ "${flag}" == "PASS" ]; then
# count number of basepairs
# this only works if the fastq has 4 lines per read, so this might fail one day
read1_bp=`eval "${cat_reads} ~{read1}" | paste - - - - | cut -f2 | tr -d '\n' | wc -c`
read2_bp=`eval "${cat_reads} ~{read2}" | paste - - - - | cut -f2 | tr -d '\n' | wc -c`
# paste - - - - (print 4 consecutive lines in one row, tab delimited)
# cut -f2 print only the second column (the second line of the fastq 4-line)
# tr -d '\n' removes line endings
# wc -c counts characters
# using fastq-scan to count the number of basepairs in each fastq
read1_bp=$(eval "${cat_reads} ~{read1}" | fastq-scan | grep 'total_bp' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
read2_bp=$(eval "${cat_reads} ~{read2}" | fastq-scan | grep 'total_bp' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
echo "DEBUG: Number of basepairs in R1: $read1_bp"
echo "DEBUG: Number of basepairs in R2: $read2_bp"

# set proportion variables for easy comparison
# removing the , 2) to make these integers instead of floats
@@ -147,7 +148,8 @@ task check_reads {
flag="FAIL; the estimated coverage (${estimated_coverage}) is less than the minimum of ~{min_coverage}x"
else
flag="PASS"
echo $estimated_genome_length | tee EST_GENOME_LENGTH
echo ${estimated_genome_length} | tee EST_GENOME_LENGTH
echo "DEBUG: estimated_genome_length: ${estimated_genome_length}"
fi
fi
fi
@@ -190,6 +192,9 @@ task check_reads_se {
Int cpu = 1
}
command <<<
# just in case anything fails, throw an error
set -euo pipefail

flag="PASS"

# initalize estimated genome length
@@ -203,11 +208,9 @@ task check_reads_se {
cat_reads="cat"
fi

# check one: number of reads
read1_num=`eval "$cat_reads ~{read1}" | awk '{s++}END{print s/4}'`
# awk '{s++}END{print s/4' counts the number of lines and divides them by 4
# key assumption: in fastq there will be four lines per read
# sometimes fastqs do not have 4 lines per read, so this might fail one day
# check one: number of reads via fastq-scan
read1_num=$($cat_reads ~{read1} | fastq-scan | grep 'read_total' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
echo "DEBUG: Number of reads in R1: ${read1_num}"

if [ "${read1_num}" -le "~{min_reads}" ] ; then
flag="FAIL; the number of reads (${read1_num}) is below the minimum of ~{min_reads}"
@@ -218,12 +221,9 @@ task check_reads_se {
# checks two and three: number of basepairs and proportion of sequence
if [ "${flag}" == "PASS" ]; then
# count number of basepairs
# this only works if the fastq has 4 lines per read, so this might fail one day
read1_bp=`eval "${cat_reads} ~{read1}" | paste - - - - | cut -f2 | tr -d '\n' | wc -c`
# paste - - - - (print 4 consecutive lines in one row, tab delimited)
# cut -f2 print only the second column (the second line of the fastq 4-line)
# tr -d '\n' removes line endings
# wc -c counts characters
# using fastq-scan to count the number of basepairs in each fastq
read1_bp=$(eval "${cat_reads} ~{read1}" | fastq-scan | grep 'total_bp' | sed 's/[^0-9]*\([0-9]\+\).*/\1/')
echo "DEBUG: Number of basepairs in R1: $read1_bp"

if [ "$flag" == "PASS" ] ; then
if [ "${read1_bp}" -le "~{min_basepairs}" ] ; then
@@ -309,7 +309,8 @@ task check_reads_se {
fi

echo $flag | tee FLAG
echo $estimated_genome_length | tee EST_GENOME_LENGTH
echo ${estimated_genome_length} | tee EST_GENOME_LENGTH
echo "DEBUG: estimated_genome_length: ${estimated_genome_length}"
>>>
output {
String read_screen = read_string("FLAG")
4 changes: 2 additions & 2 deletions tasks/quality_control/read_filtering/task_trimmomatic.wdl
Original file line number Diff line number Diff line change
@@ -40,9 +40,9 @@ task trimmomatic_pe {
-threads ~{cpu} \
~{read1} ~{read2} \
-baseout ~{samplename}.fastq.gz \
"${CROPPING_VAR}" \
SLIDINGWINDOW:~{trimmomatic_window_size}:~{trimmomatic_quality_trim_score} \
MINLEN:~{trimmomatic_min_length} &> ~{samplename}.trim.stats.txt \
"${CROPPING_VAR}"
MINLEN:~{trimmomatic_min_length} &> ~{samplename}.trim.stats.txt

>>>
output {
122 changes: 122 additions & 0 deletions tasks/species_typing/escherichia_shigella/task_stxtyper.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
version 1.0

task stxtyper {
input {
File assembly
String samplename
Boolean enable_debugging = false # Additional messages are printed and files in $TMPDIR are not removed after running
String docker = "us-docker.pkg.dev/general-theiagen/staphb/stxtyper:1.0.24"
Int disk_size = 50
Int cpu = 1
Int memory = 4
}
command <<<
# fail task if any commands below fail since there's lots of bash conditionals below (AGH!)
set -eo pipefail

# capture version info
stxtyper --version | tee VERSION.txt

# NOTE: by default stxyper uses $TMPDIR or /tmp, so if we run into issues we may need to adjust in the future. Could potentially use PWD as the TMPDIR.
echo "DEBUG: TMPDIR is set to: $TMPDIR"

echo "DEBUG: running StxTyper now..."
# run StxTyper on assembly; may need to add/remove options in the future if they change
# NOTE: stxtyper can accept gzipped assemblies, so no need to unzip
stxtyper \
--nucleotide ~{assembly} \
--name ~{samplename} \
--output ~{samplename}_stxtyper.tsv \
~{true='--debug' false='' enable_debugging} \
--log ~{samplename}_stxtyper.log

# parse output TSV
echo "DEBUG: Parsing StxTyper output TSV..."

# check for output file with only 1 line (meaning no hits found); exit cleanly if so
if [ "$(wc -l < ~{samplename}_stxtyper.tsv)" -eq 1 ]; then
echo "No hits found by StxTyper" > stxtyper_hits.txt
echo "0" > stxtyper_num_hits.txt
echo "DEBUG: No hits found in StxTyper output TSV. Exiting task with exit code 0 now."

# create empty output files
touch stxtyper_all_hits.txt stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt
# put "none" into all of them so task does not fail
echo "None" | tee stxtyper_all_hits.txt stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt
exit 0
fi

# check for output file with more than 1 line (meaning hits found); count lines & parse output TSV if so
if [ "$(wc -l < ~{samplename}_stxtyper.tsv)" -gt 1 ]; then
echo "Hits found by StxTyper. Counting lines & parsing output TSV now..."
# count number of lines in output TSV (excluding header)
wc -l < ~{samplename}_stxtyper.tsv | awk '{print $1-1}' > stxtyper_num_hits.txt
# remove header line
sed '1d' ~{samplename}_stxtyper.tsv > ~{samplename}_stxtyper_noheader.tsv

##### parse output TSV #####
### complete operons
echo "DEBUG: Parsing complete operons..."
awk -F'\t' -v OFS=, '$4 == "COMPLETE" {print $3}' ~{samplename}_stxtyper.tsv | paste -sd, - | tee stxtyper_complete_operons.txt
# if grep for COMPLETE fails, write "None" to file for output string
if [[ "$(grep --silent 'COMPLETE' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]]; then
echo "None" > stxtyper_complete_operons.txt
fi

### complete_novel operons
echo "DEBUG: Parsing complete novel hits..."
awk -F'\t' -v OFS=, '$4 == "COMPLETE_NOVEL" {print $3}' ~{samplename}_stxtyper.tsv | paste -sd, - | tee stx_novel_hits.txt
# if grep for COMPLETE_NOVEL fails, write "None" to file for output string
if [ "$(grep --silent 'COMPLETE_NOVEL' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]; then
echo "None" > stx_novel_hits.txt
fi

### partial hits (to any gene in stx operon)
echo "DEBUG: Parsing stxtyper partial hits..."
# explanation: if "operon" column contains "PARTIAL" (either PARTIAL or PARTIAL_CONTIG_END possible); print either "stx1" or "stx2" or "stx1,stx2"
awk -F'\t' -v OFS=, '$4 ~ "PARTIAL.*" {print $3}' ~{samplename}_stxtyper.tsv | sort | uniq | paste -sd, - | tee stxtyper_partial_hits.txt
# if no stx partial hits found, write "None" to file for output string
if [ "$(grep --silent 'stx' stxtyper_partial_hits.txt; echo $?)" -gt 0 ]; then
echo "None" > stxtyper_partial_hits.txt
fi

### frameshifts or internal stop codons in stx genes
echo "DEBUG: Parsing stx frameshifts or internal stop codons..."
# explanation: if operon column contains "FRAME_SHIFT" or "INTERNAL_STOP", print the "operon" in a sorted/unique list
awk -F'\t' -v OFS=, '$4 == "FRAMESHIFT" || $4 == "INTERNAL_STOP" {print $3}' ~{samplename}_stxtyper.tsv | sort | uniq | paste -sd, - | tee stxtyper_stx_frameshifts_or_internal_stop_hits.txt
# if no frameshifts or internal stop codons found, write "None" to file for output string
if [ "$(grep --silent -E 'FRAMESHIFT|INTERNAL_STOP' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]; then
echo "None" > stxtyper_stx_frameshifts_or_internal_stop_hits.txt
fi

echo "DEBUG: generating stx_type_all string output now..."
# sort and uniq so there are no duplicates; then paste into a single comma-separated line with commas
# sed is to remove any instances of "None" from the output
cat stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt | sed '/None/d' | sort | uniq | paste -sd, - > stxtyper_all_hits.txt

fi
echo "DEBUG: Finished parsing StxTyper output TSV."
>>>
output {
File stxtyper_report = "~{samplename}_stxtyper.tsv"
File stxtyper_log = "~{samplename}_stxtyper.log"
String stxtyper_docker = docker
String stxtyper_version = read_string("VERSION.txt")
# outputs parsed from stxtyper output TSV
Int stxtyper_num_hits = read_int("stxtyper_num_hits.txt")
String stxtyper_all_hits = read_string("stxtyper_all_hits.txt")
String stxtyper_complete_operon_hits = read_string("stxtyper_complete_operons.txt")
String stxtyper_partial_hits = read_string("stxtyper_partial_hits.txt")
String stxtyper_frameshifts_or_internal_stop_hits = read_string("stxtyper_stx_frameshifts_or_internal_stop_hits.txt")
String stxtyper_novel_hits = read_string("stx_novel_hits.txt")
}
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
preemptible: 1 # does not take long (usually <3 min) to run stxtyper on 1 genome, preemptible is fine
maxRetries: 3
}
}
25 changes: 13 additions & 12 deletions tasks/species_typing/mycobacterium/task_tbp_parser.wdl
Original file line number Diff line number Diff line change
@@ -9,29 +9,30 @@ task tbp_parser {

String? sequencing_method
String? operator

Int? min_depth # default 10
Int? coverage_threshold # default 100 (--min_percent_coverage)
File? coverage_regions_bed
Float? min_frequency # default 0.1
Int? min_read_support # default 10
Int? coverage_threshold # default 100 (--min_percent_coverage)
File? coverage_regions_bed

Boolean tbp_parser_debug = false

Boolean add_cycloserine_lims = false

Boolean tbp_parser_debug = true
Boolean tngs_data = false

Float? rrs_frequency # default 0.1
Int? rrs_read_support # default 10
Float? rrl_frequency # default 0.1
Int? rrl_read_support # default 10
Float? rpob449_frequency # default 0.1
Float? etha237_frequency # default 0.1
File? expert_rule_regions_bed

String docker = "us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.6.0"
Int disk_size = 100
Int memory = 4

Int cpu = 1
Int disk_size = 100
String docker = "us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:2.2.2"
Int memory = 4
}
command <<<
# get version
@@ -42,10 +43,10 @@ task tbp_parser {
~{"--sequencing_method " + sequencing_method} \
~{"--operator " + operator} \
~{"--min_depth " + min_depth} \
~{"--min_percent_coverage " + coverage_threshold} \
~{"--coverage_regions " + coverage_regions_bed} \
~{"--min_frequency " + min_frequency} \
~{"--min_read_support " + min_read_support} \
~{"--min_percent_coverage " + coverage_threshold} \
~{"--coverage_regions " + coverage_regions_bed} \
~{"--tngs_expert_regions " + expert_rule_regions_bed} \
~{"--rrs_frequency " + rrs_frequency} \
~{"--rrs_read_support " + rrs_read_support} \
@@ -63,7 +64,7 @@ task tbp_parser {
echo 0.0 > AVG_DEPTH

# get genome percent coverage for the entire reference genome length over min_depth
genome=$(samtools depth -J ~{tbprofiler_bam} | awk -F "\t" '{if ($3 >= ~{min_depth}) print;}' | wc -l )
genome=$(samtools depth -J ~{tbprofiler_bam} | awk -F "\t" -v min_depth=~{min_depth} '{if ($3 >= min_depth) print;}' | wc -l )
python3 -c "print ( ($genome / 4411532 ) * 100 )" | tee GENOME_PC

# get genome average depth
127 changes: 57 additions & 70 deletions tasks/species_typing/mycobacterium/task_tbprofiler.wdl
Original file line number Diff line number Diff line change
@@ -5,84 +5,74 @@ task tbprofiler {
File read1
File? read2
String samplename

# logic
Boolean ont_data = false
Boolean tbprofiler_run_custom_db = false
File? tbprofiler_custom_db
# minimum thresholds
Int cov_frac_threshold = 1
Float min_af = 0.1
Float min_af_pred = 0.1
Int min_depth = 10
# tool options within tbprofiler

String mapper = "bwa"
String variant_caller = "freebayes"
String variant_caller = "gatk"
String? variant_calling_params
# runtime

String? additional_parameters # for tbprofiler
Int min_depth = 10
Float min_af = 0.1

File? tbprofiler_custom_db
Boolean tbprofiler_run_cdph_db = false
Boolean tbprofiler_run_custom_db = false

Int cpu = 8
Int disk_size = 100
String docker = "us-docker.pkg.dev/general-theiagen/staphb/tbprofiler:4.4.2"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/tbprofiler:6.4.1"
Int memory = 16
}
command <<<
# Print and save date
date | tee DATE

# Print and save version
tb-profiler version > VERSION && sed -i -e 's/TBProfiler version //' VERSION && sed -n -i '$p' VERSION

# check if file is non existant or non empty
if [ -z "~{read2}" ] || [ ! -s "~{read2}" ] ; then
if [ -z "~{read2}" ] || [ ! -s "~{read2}" ]; then
INPUT_READS="-1 ~{read1}"
else
INPUT_READS="-1 ~{read1} -2 ~{read2}"
fi

if [ "~{ont_data}" = true ]; then
mode="--platform nanopore"
export ont_data="true"
else
export ont_data="false"
fi

# check if new database file is provided and not empty
if [ "~{tbprofiler_run_custom_db}" = true ] ; then
echo "Found new database file ~{tbprofiler_custom_db}"
prefix=$(basename "~{tbprofiler_custom_db}" | sed 's/\.tar\.gz$//')
echo "New database will be created with prefix $prefix"

echo "Inflating the new database..."
tar xfv ~{tbprofiler_custom_db}
if ~{tbprofiler_run_custom_db}; then
if [ ! -s ~{tbprofiler_custom_db} ]; then
echo "Custom database file is empty"
TBDB=""
else
echo "Found new database file ~{tbprofiler_custom_db}"
prefix=$(basename "~{tbprofiler_custom_db}" | sed 's/\.tar\.gz$//')
tar xfv ~{tbprofiler_custom_db}

tb-profiler load_library ./"$prefix"/"$prefix"

tb-profiler load_library ./"$prefix"/"$prefix"

TBDB="--db $prefix"
else
TBDB=""
TBDB="--db $prefix"
fi
elif ~{tbprofiler_run_cdph_db}; then
tb-profiler update_tbdb --branch CaliforniaDPH
TBDB="--db CaliforniaDPH"
fi

# Run tb-profiler on the input reads with samplename prefix
tb-profiler profile \
${mode} \
${INPUT_READS} \
--prefix ~{samplename} \
--mapper ~{mapper} \
--caller ~{variant_caller} \
--calling_params "~{variant_calling_params}" \
--min_depth ~{min_depth} \
--depth ~{min_depth} \
--af ~{min_af} \
--reporting_af ~{min_af_pred} \
--coverage_fraction_threshold ~{cov_frac_threshold} \
--threads ~{cpu} \
--csv --txt \
$TBDB
~{true="--platform nanopore" false="" ont_data} \
~{additional_parameters} \
${TBDB}

# Collate results
tb-profiler collate --prefix ~{samplename}

# touch optional output files because wdl
touch GENE_NAME LOCUS_TAG VARIANT_SUBSTITUTIONS OUTPUT_SEQ_METHOD_TYPE

# merge all vcf files if multiple are present
bcftools index ./vcf/*bcf
bcftools index ./vcf/*gz
@@ -97,35 +87,32 @@ task tbprofiler {
tsv_reader=csv.reader(tsv_file, delimiter="\t")
tsv_data=list(tsv_reader)
tsv_dict=dict(zip(tsv_data[0], tsv_data[1]))
with open ("MAIN_LINEAGE", 'wt') as Main_Lineage:
main_lin=tsv_dict['main_lineage']
Main_Lineage.write(main_lin)
with open ("SUB_LINEAGE", 'wt') as Sub_Lineage:
sub_lin=tsv_dict['sub_lineage']
Sub_Lineage.write(sub_lin)
with open ("DR_TYPE", 'wt') as DR_Type:
dr_type=tsv_dict['DR_type']
DR_Type.write(dr_type)
with open ("NUM_DR_VARIANTS", 'wt') as Num_DR_Variants:
num_dr_vars=tsv_dict['num_dr_variants']
Num_DR_Variants.write(num_dr_vars)
with open ("NUM_OTHER_VARIANTS", 'wt') as Num_Other_Variants:
num_other_vars=tsv_dict['num_other_variants']
Num_Other_Variants.write(num_other_vars)
with open ("RESISTANCE_GENES", 'wt') as Resistance_Genes:
res_genes_list=['rifampicin', 'isoniazid', 'pyrazinamide', 'ethambutol', 'streptomycin', 'fluoroquinolones', 'moxifloxacin', 'ofloxacin', 'levofloxacin', 'ciprofloxacin', 'aminoglycosides', 'amikacin', 'kanamycin', 'capreomycin', 'ethionamide', 'para-aminosalicylic_acid', 'cycloserine', 'linezolid', 'bedaquiline', 'clofazimine', 'delamanid']
with open ("MAIN_LINEAGE", 'wt') as main_lineage:
main_lineage.write(tsv_dict['main_lineage'])
with open ("SUB_LINEAGE", 'wt') as sublineage:
sublineage.write(tsv_dict['sub_lineage'])
with open ("DR_TYPE", 'wt') as dr_type:
dr_type.write(tsv_dict['drtype'])
with open ("NUM_DR_VARIANTS", 'wt') as num_dr_variants:
num_dr_variants.write(tsv_dict['num_dr_variants'])
with open ("NUM_OTHER_VARIANTS", 'wt') as num_other_variants:
num_other_variants.write(tsv_dict['num_other_variants'])
with open ("RESISTANCE_GENES", 'wt') as resistance_genes:
res_genes_list=['rifampicin', 'isoniazid', 'ethambutol', 'pyrazinamide', 'moxifloxacin', 'levofloxacin', 'bedaquiline', 'delamanid', 'pretomanid', 'linezolid', 'streptomycin', 'amikacin', 'kanamycin', 'capreomycin', 'clofazimine', 'ethionamide', 'para-aminosalicylic_acid', 'cycloserine']
res_genes=[]
for i in res_genes_list:
if tsv_dict[i] != '-':
res_genes.append(tsv_dict[i])
res_genes_string=';'.join(res_genes)
Resistance_Genes.write(res_genes_string)
with open ("MEDIAN_COVERAGE", 'wt') as Median_Coverage:
median_coverage=tsv_dict['median_coverage']
Median_Coverage.write(median_coverage)
with open ("PCT_READS_MAPPED", 'wt') as Pct_Reads_Mapped:
pct_reads_mapped=tsv_dict['pct_reads_mapped']
Pct_Reads_Mapped.write(pct_reads_mapped)
resistance_genes.write(res_genes_string)
with open ("MEDIAN_DEPTH", 'wt') as median_depth:
median_depth.write(tsv_dict['target_median_depth'])
with open ("PCT_READS_MAPPED", 'wt') as pct_reads_mapped:
pct_reads_mapped.write(tsv_dict['pct_reads_mapped'])
CODE
>>>
output {
@@ -134,15 +121,15 @@ task tbprofiler {
File tbprofiler_output_json = "./results/~{samplename}.results.json"
File tbprofiler_output_bam = "./bam/~{samplename}.bam"
File tbprofiler_output_bai = "./bam/~{samplename}.bam.bai"
File tbprofiler_output_vcf = "./vcf/~{samplename}.targets.csq.merged.vcf"
File? tbprofiler_output_vcf = "./vcf/~{samplename}.targets.csq.merged.vcf"
String version = read_string("VERSION")
String tbprofiler_main_lineage = read_string("MAIN_LINEAGE")
String tbprofiler_sub_lineage = read_string("SUB_LINEAGE")
String tbprofiler_dr_type = read_string("DR_TYPE")
String tbprofiler_num_dr_variants = read_string("NUM_DR_VARIANTS")
String tbprofiler_num_other_variants = read_string("NUM_OTHER_VARIANTS")
String tbprofiler_resistance_genes = read_string("RESISTANCE_GENES")
Int tbprofiler_median_coverage = read_int("MEDIAN_COVERAGE")
Float tbprofiler_median_depth = read_float("MEDIAN_DEPTH")
Float tbprofiler_pct_reads_mapped = read_float("PCT_READS_MAPPED")
}
runtime {
2 changes: 1 addition & 1 deletion tasks/task_versioning.wdl
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@ task version_capture {
volatile: true
}
command {
PHB_Version="PHB v2.2.1"
PHB_Version="PHB v2.3.0"
~{default='' 'export TZ=' + timezone}
date +"%Y-%m-%d" > TODAY
echo "$PHB_Version" > PHB_VERSION
80 changes: 54 additions & 26 deletions tasks/taxon_id/contamination/task_kraken2.wdl
Original file line number Diff line number Diff line change
@@ -5,48 +5,69 @@ task kraken2_theiacov {
File read1
File? read2
String samplename
String kraken2_db = "/kraken2-db"
File kraken2_db = "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz"
Int cpu = 4
Int memory = 8
String? target_organism
Int disk_size = 100
String docker_image = "us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv"
String docker_image = "us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db"
}
command <<<
# date and version control
date | tee DATE
kraken2 --version | head -n1 | tee VERSION
num_reads=$(ls *fastq.gz 2> /dev/nul | wc -l)

# Decompress the Kraken2 database
mkdir db
tar -C ./db/ -xzvf ~{kraken2_db}

if ! [ -z ~{read2} ]; then
mode="--paired"
fi
echo $mode
kraken2 $mode \

# determine if reads are compressed
if [[ ~{read1} == *.gz ]]; then
echo "Reads are compressed..."
compressed="--gzip-compressed"
fi
echo $compressed

# Run Kraken2
kraken2 $mode $compressed \
--threads ~{cpu} \
--db ~{kraken2_db} \
--db ./db/ \
~{read1} ~{read2} \
--report ~{samplename}_kraken2_report.txt \
--output ~{samplename}.classifiedreads.txt

# Compress and cleanup
gzip ~{samplename}.classifiedreads.txt

# capture human percentage
percentage_human=$(grep "Homo sapiens" ~{samplename}_kraken2_report.txt | cut -f 1)
# | tee PERCENT_HUMAN
percentage_sc2=$(grep "Severe acute respiratory syndrome coronavirus 2" ~{samplename}_kraken2_report.txt | cut -f1 )
# | tee PERCENT_COV
if [ -z "$percentage_human" ] ; then percentage_human="0" ; fi
if [ -z "$percentage_sc2" ] ; then percentage_sc2="0" ; fi
echo $percentage_human | tee PERCENT_HUMAN
echo $percentage_sc2 | tee PERCENT_SC2
# capture target org percentage

# capture target org percentage
if [ ! -z "~{target_organism}" ]; then
echo "Target org designated: ~{target_organism}"
percent_target_organism=$(grep "~{target_organism}" ~{samplename}_kraken2_report.txt | cut -f1 | head -n1 )
if [ -z "$percent_target_organism" ] ; then percent_target_organism="0" ; fi
else
# if target organisms is sc2, report it in a special legacy column called PERCENT_SC2
if [[ "~{target_organism}" == "Severe acute respiratory syndrome coronavirus 2" ]]; then
percentage_sc2=$(grep "Severe acute respiratory syndrome coronavirus 2" ~{samplename}_kraken2_report.txt | cut -f1 )
percent_target_organism=""
if [ -z "$percentage_sc2" ] ; then percentage_sc2="0" ; fi
else
percentage_sc2=""
percent_target_organism=$(grep "~{target_organism}" ~{samplename}_kraken2_report.txt | cut -f1 | head -n1 )
if [ -z "$percent_target_organism" ] ; then percent_target_organism="0" ; fi
fi
else
percent_target_organism=""
percentage_sc2=""
fi
echo $percentage_sc2 | tee PERCENT_SC2
echo $percent_target_organism | tee PERCENT_TARGET_ORGANISM

>>>
@@ -55,7 +76,7 @@ task kraken2_theiacov {
String version = read_string("VERSION")
File kraken_report = "~{samplename}_kraken2_report.txt"
Float percent_human = read_float("PERCENT_HUMAN")
Float percent_sc2 = read_float("PERCENT_SC2")
String percent_sc2 = read_string("PERCENT_SC2")
String percent_target_organism = read_string("PERCENT_TARGET_ORGANISM")
String? kraken_target_organism = target_organism
File kraken2_classified_report = "~{samplename}.classifiedreads.txt.gz"
@@ -205,30 +226,37 @@ task kraken2_parse_classified {
CODE
# theiacov parsing blocks - percent human, sc2 and target organism
# capture human percentage
percentage_human=$(grep "Homo sapiens" ~{samplename}.report_parsed.txt | cut -f 1)
percentage_sc2=$(grep "Severe acute respiratory syndrome coronavirus 2" ~{samplename}.report_parsed.txt | cut -f1 )
if [ -z "$percentage_human" ] ; then percentage_human="0" ; fi
if [ -z "$percentage_sc2" ] ; then percentage_sc2="0" ; fi
echo $percentage_human | tee PERCENT_HUMAN
echo $percentage_sc2 | tee PERCENT_SC2
# capture target org percentage
if [ ! -z "~{target_organism}" ]; then
# capture target org percentage
if [ ! -z "~{target_organism}" ]; then
echo "Target org designated: ~{target_organism}"
percent_target_organism=$(grep "~{target_organism}" ~{samplename}.report_parsed.txt | cut -f1 | head -n1 )
if [ -z "$percent_target_organism" ] ; then percent_target_organism="0" ; fi
else
# if target organisms is sc2, report it in a special legacy column called PERCENT_SC2
if [[ "~{target_organism}" == "Severe acute respiratory syndrome coronavirus 2" ]]; then
percentage_sc2=$(grep "Severe acute respiratory syndrome coronavirus 2" ~{samplename}.report_parsed.txt | cut -f1 )
percent_target_organism=""
if [ -z "$percentage_sc2" ] ; then percentage_sc2="0" ; fi
else
percentage_sc2=""
percent_target_organism=$(grep "~{target_organism}" ~{samplename}.report_parsed.txt | cut -f1 | head -n1 )
if [ -z "$percent_target_organism" ] ; then percent_target_organism="0" ; fi
fi
else
percent_target_organism=""
percentage_sc2=""
fi
echo $percent_target_organism | tee PERCENT_TARGET_ORG
echo $percentage_sc2 | tee PERCENT_SC2
echo $percent_target_organism | tee PERCENT_TARGET_ORGANISM
>>>
output {
File kraken_report = "~{samplename}.report_parsed.txt"
Float percent_human = read_float("PERCENT_HUMAN")
Float percent_sc2 = read_float("PERCENT_SC2")
String percent_target_organism = read_string("PERCENT_TARGET_ORG")
String percent_sc2 = read_string("PERCENT_SC2")
String percent_target_organism = read_string("PERCENT_TARGET_ORGANISM")
String? kraken_target_organism = target_organism
}
runtime {
18 changes: 10 additions & 8 deletions tasks/taxon_id/freyja/task_freyja.wdl
Original file line number Diff line number Diff line change
@@ -5,7 +5,8 @@ task freyja_one_sample {
File primer_trimmed_bam
String samplename
File reference_genome
File? freyja_usher_barcodes
String? freyja_pathogen
File? freyja_barcodes
File? freyja_lineage_metadata
Float? eps
Float? adapt
@@ -16,7 +17,7 @@ task freyja_one_sample {
Int? depth_cutoff
Int memory = 8
Int cpu = 2
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.1-07_02_2024-01-27-2024-07-22"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.2-11_30_2024-02-00-2024-12-02"
Int disk_size = 100
}
command <<<
@@ -44,9 +45,9 @@ task freyja_one_sample {
freyja_metadata_version="freyja update: $(date +"%Y-%m-%d")"
else
# configure barcode
if [[ ! -z "~{freyja_usher_barcodes}" ]]; then
echo "User freyja usher barcodes identified; ~{freyja_usher_barcodes} will be utilized for freyja demixing"
freyja_usher_barcode_version=$(basename -- "~{freyja_usher_barcodes}")
if [[ ! -z "~{freyja_barcodes}" ]]; then
echo "User freyja usher barcodes identified; ~{freyja_barcodes} will be utilized for freyja demixing"
freyja_usher_barcode_version=$(basename -- "~{freyja_barcodes}")
else
freyja_usher_barcode_version="unmodified from freyja container: ~{docker}"
fi
@@ -74,9 +75,10 @@ task freyja_one_sample {
# Calculate Boostraps, if specified
if ~{bootstrap}; then
freyja boot \
~{"--pathogen" + freyja_pathogen} \
~{"--eps " + eps} \
~{"--meta " + freyja_lineage_metadata} \
~{"--barcodes " + freyja_usher_barcodes} \
~{"--barcodes " + freyja_barcodes} \
~{"--depthcutoff " + depth_cutoff} \
~{"--nb " + number_bootstraps } \
~{true='--confirmedonly' false='' confirmed_only} \
@@ -91,7 +93,7 @@ task freyja_one_sample {
freyja demix \
~{'--eps ' + eps} \
~{'--meta ' + freyja_lineage_metadata} \
~{'--barcodes ' + freyja_usher_barcodes} \
~{'--barcodes ' + freyja_barcodes} \
~{'--depthcutoff ' + depth_cutoff} \
~{true='--confirmedonly' false='' confirmed_only} \
~{'--adapt ' + adapt} \
@@ -144,7 +146,7 @@ task freyja_one_sample {
File? freyja_bootstrap_summary = "~{samplename}_summarized.csv"
File? freyja_bootstrap_summary_pdf = "~{samplename}_summarized.pdf"
# capture barcode file - first is user supplied, second appears if the user did not supply a barcode file
File freyja_usher_barcode_file = select_first([freyja_usher_barcodes, "usher_barcodes.feather"])
File freyja_barcode_file = select_first([freyja_barcodes, "usher_barcodes.feather"])
File freyja_lineage_metadata_file = select_first([freyja_lineage_metadata, "curated_lineages.json"])
String freyja_barcode_version = read_string("FREYJA_BARCODES")
String freyja_metadata_version = read_string("FREYJA_METADATA")
2 changes: 1 addition & 1 deletion tasks/taxon_id/freyja/task_freyja_dashboard.wdl
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ task freyja_dashboard_task {
Boolean scale_by_viral_load = false
String freyja_dashboard_title
File? dashboard_intro_text
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.1-07_02_2024-01-27-2024-07-22"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.2-11_30_2024-02-00-2024-12-02"
Int disk_size = 100
Int memory = 4
Int cpu = 2
2 changes: 1 addition & 1 deletion tasks/taxon_id/freyja/task_freyja_plot.wdl
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ task freyja_plot_task {
String plot_time_interval="MS"
Int plot_day_window=14
String freyja_plot_name
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.1-07_02_2024-01-27-2024-07-22"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.2-11_30_2024-02-00-2024-12-02"
Int disk_size = 100
Int mincov = 60
Int memory = 4
2 changes: 1 addition & 1 deletion tasks/taxon_id/freyja/task_freyja_update.wdl
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@ version 1.0

task freyja_update_refs {
input {
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.1-07_02_2024-01-27-2024-07-22"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/freyja:1.5.2-11_30_2024-02-00-2024-12-02"
Int disk_size = 100
Int memory = 16
Int cpu = 4
13 changes: 11 additions & 2 deletions tasks/utilities/data_export/task_broad_terra_tools.wdl
Original file line number Diff line number Diff line change
@@ -35,6 +35,10 @@ task export_taxon_tables {
Int? num_reads_raw2
String? num_reads_raw_pairs
String? fastq_scan_version
File? fastq_scan_raw1_json
File? fastq_scan_raw2_json
File? fastq_scan_clean1_json
File? fastq_scan_clean2_json
Int? num_reads_clean1
Int? num_reads_clean2
String? num_reads_clean_pairs
@@ -390,7 +394,8 @@ task export_taxon_tables {
volatile: true
}
command <<<

set -euo pipefail

# capture taxon and corresponding table names from input taxon_tables
taxon_array=($(cut -f1 ~{taxon_tables} | tail +2))
echo "Taxon array: ${taxon_array[*]}"
@@ -446,6 +451,10 @@ task export_taxon_tables {
"num_reads_raw2": "~{num_reads_raw2}",
"num_reads_raw_pairs": "~{num_reads_raw_pairs}",
"fastq_scan_version": "~{fastq_scan_version}",
"fastq_scan_raw1_json": "~{fastq_scan_raw1_json}",
"fastq_scan_raw2_json": "~{fastq_scan_raw2_json}",
"fastq_scan_clean1_json": "~{fastq_scan_clean1_json}",
"fastq_scan_clean2_json": "~{fastq_scan_clean2_json}",
"num_reads_clean1": "~{num_reads_clean1}",
"num_reads_clean2": "~{num_reads_clean2}",
"num_reads_clean_pairs": "~{num_reads_clean_pairs}",
@@ -778,7 +787,7 @@ task export_taxon_tables {
"agrvate_version": "~{agrvate_version}",
"agrvate_docker": "~{agrvate_docker}",
"srst2_vibrio_detailed_tsv": "~{srst2_vibrio_detailed_tsv}",
"srst2_vibrio_version": "~{srst2_vibrio_version}",~
"srst2_vibrio_version": "~{srst2_vibrio_version}",
"srst2_vibrio_docker": "~{srst2_vibrio_docker}",
"srst2_vibrio_database": "~{srst2_vibrio_database}",
"srst2_vibrio_ctxA": "~{srst2_vibrio_ctxA}",
7 changes: 6 additions & 1 deletion tasks/utilities/data_export/task_download_terra_table.wdl
Original file line number Diff line number Diff line change
@@ -12,11 +12,14 @@ task download_terra_table {
String terra_workspace_name
String terra_project_name
Int disk_size = 10
Int memory = 1
Int memory = 2
Int cpu = 1
String docker = "us-docker.pkg.dev/general-theiagen/theiagen/terra-tools:2023-06-21"
}
command <<<
# set -euo pipefail to avoid silent failure
set -euo pipefail

python3 /scripts/export_large_tsv/export_large_tsv.py --project ~{terra_project_name} --workspace ~{terra_workspace_name} --entity_type ~{terra_table_name} --tsv_filename "~{terra_table_name}.tsv"
>>>
output {
@@ -29,5 +32,7 @@ task download_terra_table {
disks: "local-disk " + disk_size + " HDD"
disk: disk_size + " GB"
dx_instance_type: "mem1_ssd1_v2_x2"
preemptible: 0 # this task may take a long time and shouldn't be preempted
maxRetries: 3
}
}
1 change: 1 addition & 0 deletions tasks/utilities/data_export/task_export_two_tsvs.wdl
Original file line number Diff line number Diff line change
@@ -18,6 +18,7 @@ task export_two_tsvs {
volatile: true
}
command <<<
set -euo pipefail
python3 /scripts/export_large_tsv/export_large_tsv.py --project ~{terra_project1} --workspace ~{terra_workspace1} --entity_type ~{datatable1} --tsv_filename "~{datatable1}_table1.tsv"

# check if second project is provided; if not, use first
62 changes: 62 additions & 0 deletions tasks/utilities/data_handling/task_fetch_srr_accession.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
version 1.0

task fetch_srr_accession {
input {
String sample_accession
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0"
Int disk_size = 10
Int cpu = 2
Int memory = 8
}
meta {
volatile: true
}
command <<<
set -euo pipefail

# Output the current date and fastq-dl version for debugging
date -u | tee DATE
fastq-dl --version | tee VERSION

echo "Fetching metadata for accession: ~{sample_accession}"

# Run fastq-dl and capture stderr
fastq-dl --accession ~{sample_accession} --only-download-metadata -m 2 --verbose 2> stderr.log || true

# Handle whether the ID/accession is valid and contains SRR metadata based on stderr
if grep -q "No results found for" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "received an empty response" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "is not a Study, Sample, Experiment, or Run accession" stderr.log; then
echo "Invalid accession: ~{sample_accession}" >&2
exit 1
elif [[ ! -f fastq-run-info.tsv ]]; then
echo "No metadata file found for accession: ~{sample_accession}" >&2
exit 1
else
# Extract SRR accessions from the TSV file if it exists
SRR_accessions=$(awk -F'\t' 'NR>1 {print $1}' fastq-run-info.tsv | paste -sd ',' -)
if [[ -z "${SRR_accessions}" ]]; then
echo "No SRR accession found" > srr_accession.txt
else
echo "Extracted SRR accessions: ${SRR_accessions}"
echo "${SRR_accessions}" > srr_accession.txt
fi
fi
>>>
output {
String srr_accession = read_string("srr_accession.txt")
String fastq_dl_version = read_string("VERSION")
}
runtime {
docker: docker
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
preemptible: 1
}
}
2 changes: 2 additions & 0 deletions tasks/utilities/data_handling/task_summarize_data.wdl
Original file line number Diff line number Diff line change
@@ -23,6 +23,8 @@ task summarize_data {
volatile: true
}
command <<<
set -euo pipefail

# when running on terra, comment out all input_table mentions
python3 /scripts/export_large_tsv/export_large_tsv.py --project "~{terra_project}" --workspace "~{terra_workspace}" --entity_type ~{terra_table} --tsv_filename ~{terra_table}-data.tsv

2 changes: 2 additions & 0 deletions tasks/utilities/data_handling/task_theiacov_fasta_batch.wdl
Original file line number Diff line number Diff line change
@@ -28,6 +28,8 @@ task sm_theiacov_fasta_wrangling { # the sm stands for supermassive
Int memory = 4
}
command <<<
set -euo pipefail

# check if nextclade json file exists
if [ -f ~{nextclade_json} ]; then
# this line splits into individual json files
4 changes: 4 additions & 0 deletions tasks/utilities/data_import/task_create_terra_table.wdl
Original file line number Diff line number Diff line change
@@ -146,6 +146,10 @@ task create_terra_table {
done <filelist-fullpath.txt

echo "DEBUG: terra table created, now beginning upload"

# set error handling to exit if the subsequent import_large_tsv.py task fails
set -euo pipefail

python3 /scripts/import_large_tsv/import_large_tsv.py --project "~{terra_project}" --workspace "~{terra_workspace}" --tsv terra_table_to_upload.tsv
>>>
output {
54 changes: 54 additions & 0 deletions tasks/utilities/file_handling/task_cat_lanes.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
version 1.0

task cat_lanes {
input {
String samplename

File read1_lane1
File read1_lane2
File? read1_lane3
File? read1_lane4

File? read2_lane1
File? read2_lane2
File? read2_lane3
File? read2_lane4

Int cpu = 2
Int disk_size = 50
String docker = "us-docker.pkg.dev/general-theiagen/theiagen/utility:1.2"
Int memory = 4
}
meta {
volatile: true
}
command <<<
# exit task if anything throws an error (important for proper gzip format)
set -euo pipefail

exists() { [[ -f $1 ]]; }

set -euo pipefail

cat ~{read1_lane1} ~{read1_lane2} ~{read1_lane3} ~{read1_lane4} > "~{samplename}_merged_R1.fastq.gz"

if exists ~{read2_lane1} ; then
cat ~{read2_lane1} ~{read2_lane2} ~{read2_lane3} ~{read2_lane4} > "~{samplename}_merged_R2.fastq.gz"
fi

# ensure newly merged FASTQs are valid gzipped format
gzip -t *merged*.gz
>>>
output {
File read1_concatenated = "~{samplename}_merged_R1.fastq.gz"
File? read2_concatenated = "~{samplename}_merged_R2.fastq.gz"
}
runtime {
docker: "~{docker}"
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
preemptible: 1
}
}
2 changes: 2 additions & 0 deletions tasks/utilities/file_handling/task_transfer_files.wdl
Original file line number Diff line number Diff line change
@@ -14,6 +14,8 @@ task transfer_files {
volatile: true
}
command <<<
set -euo pipefail

file_path_array="~{sep=' ' files_to_transfer}"

gsutil -m cp -n ${file_path_array[@]} ~{target_bucket}
5 changes: 4 additions & 1 deletion tasks/utilities/submission/task_mercury.wdl
Original file line number Diff line number Diff line change
@@ -23,12 +23,15 @@ task mercury {
Int cpu = 2
Int disk_size = 100
Int memory = 8
String docker = "us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.0.8"
String docker = "us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.0.9"
}
meta {
volatile: true
}
command <<<
#set -euo pipefail to avoid silent failure
set -euo pipefail

python3 /mercury/mercury/mercury.py -v | tee VERSION

python3 /mercury/mercury/mercury.py \
4 changes: 3 additions & 1 deletion tasks/utilities/submission/task_submission.wdl
Original file line number Diff line number Diff line change
@@ -23,6 +23,8 @@ task prune_table {
volatile: true
}
command <<<
set -euo pipefail

# when running on terra, comment out all input_table mentions
python3 /scripts/export_large_tsv/export_large_tsv.py --project "~{project_name}" --workspace "~{workspace_name}" --entity_type ~{table_name} --tsv_filename ~{table_name}-data.tsv

@@ -54,7 +56,7 @@ task prune_table {
# read export table into pandas
tablename = "~{table_name}-data.tsv"
table = pd.read_csv(tablename, delimiter='\t', header=0, dtype={"~{table_name}_id": 'str'}) # ensure sample_id is always a string)
table = pd.read_csv(tablename, delimiter='\t', header=0, dtype={"~{table_name}_id": 'str', "collection_date": 'str'}) # ensure sample_id is always a string)
# extract the samples for upload from the entire table
table = table[table["~{table_name}_id"].isin("~{sep='*' sample_names}".split("*"))]
1 change: 0 additions & 1 deletion tests/config/environment.yml
Original file line number Diff line number Diff line change
@@ -2,7 +2,6 @@ name: pytest-env-CI
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python >=3.7
- cromwell=86
Binary file not shown.
4 changes: 3 additions & 1 deletion tests/inputs/theiacov/wf_theiacov_clearlabs.json
Original file line number Diff line number Diff line change
@@ -3,5 +3,7 @@
"theiacov_clearlabs.read1": "tests/data/theiacov/fastqs/clearlabs/clearlabs.fastq.gz",
"theiacov_clearlabs.primer_bed": "tests/data/theiacov/primers/artic-v3.primers.bed",
"theiacov_clearlabs.reference_genome": "tests/data/theiacov/reference/MN908947.fasta",
"theiacov_clearlabs.organism_parameters.gene_locations_bed_file": "tests/inputs/sc2_gene_locations.bed"
"theiacov_clearlabs.organism_parameters.gene_locations_bed_file": "tests/inputs/sc2_gene_locations.bed",
"theiacov_clearlabs.kraken2_raw.kraken2_db": "tests/data/theiacov/databases/github_kraken2_test_db.tar.gz",
"theiacov_clearlabs.kraken2_dehosted.kraken2_db": "tests/data/theiacov/databases/github_kraken2_test_db.tar.gz"
}
3 changes: 2 additions & 1 deletion tests/inputs/theiacov/wf_theiacov_illumina_pe.json
Original file line number Diff line number Diff line change
@@ -5,5 +5,6 @@
"theiacov_illumina_pe.primer_bed": "tests/data/theiacov/primers/artic-v3.primers.bed",
"theiacov_illumina_pe.reference_genome": "tests/data/theiacov/reference/MN908947.fasta",
"theiacov_illumina_pe.reference_gff": "tests/inputs/completely-empty-for-test.txt",
"theiacov_illumina_pe.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed"
"theiacov_illumina_pe.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed",
"theiacov_illumina_pe.read_QC_trim.kraken_db": "tests/data/theiacov/databases/github_kraken2_test_db.tar.gz"
}
3 changes: 2 additions & 1 deletion tests/inputs/theiacov/wf_theiacov_illumina_se.json
Original file line number Diff line number Diff line change
@@ -4,5 +4,6 @@
"theiacov_illumina_se.primer_bed": "tests/data/theiacov/primers/artic-v3.primers.bed",
"theiacov_illumina_se.reference_genome": "tests/data/theiacov/reference/MN908947.fasta",
"theiacov_illumina_se.reference_gff": "tests/inputs/completely-empty-for-test.txt",
"theiacov_illumina_se.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed"
"theiacov_illumina_se.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed",
"theiacov_illumina_se.read_QC_trim.kraken_db": "tests/data/theiacov/databases/github_kraken2_test_db.tar.gz"
}
3 changes: 2 additions & 1 deletion tests/inputs/theiacov/wf_theiacov_ont.json
Original file line number Diff line number Diff line change
@@ -3,5 +3,6 @@
"theiacov_ont.read1": "tests/data/theiacov/fastqs/ont/ont.fastq.gz",
"theiacov_ont.primer_bed": "tests/data/theiacov/primers/artic-v3.primers.bed",
"theiacov_ont.reference_genome": "tests/data/theiacov/reference/MN908947.fasta",
"theiacov_ont.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed"
"theiacov_ont.reference_gene_locations_bed": "tests/inputs/sc2_gene_locations.bed",
"theiacov_ont.read_qc_trim.kraken_db": "tests/data/theiacov/databases/github_kraken2_test_db.tar.gz"
}
44 changes: 21 additions & 23 deletions tests/workflows/theiacov/test_wf_theiacov_clearlabs.yml
Original file line number Diff line number Diff line change
@@ -17,7 +17,7 @@
- wf_theiacov_clearlabs_miniwdl
files:
- path: miniwdl_run/call-consensus/command
md5sum: a8e200703dedf732b45dd92b0af15f1c
md5sum: b19d5ce485c612036064c07f0a1d6a18
- path: miniwdl_run/call-consensus/inputs.json
contains: ["read1", "samplename", "fastq"]
- path: miniwdl_run/call-consensus/outputs.json
@@ -115,17 +115,16 @@
- path: miniwdl_run/call-fastq_scan_clean_reads/inputs.json
contains: ["read1", "clearlabs"]
- path: miniwdl_run/call-fastq_scan_clean_reads/outputs.json
contains: ["fastq_scan_se", "pipeline_date", "read1_seq"]
contains: ["fastq_scan_se", "read1_seq"]
- path: miniwdl_run/call-fastq_scan_clean_reads/stderr.txt
- path: miniwdl_run/call-fastq_scan_clean_reads/stderr.txt.offset
- path: miniwdl_run/call-fastq_scan_clean_reads/stdout.txt
- path: miniwdl_run/call-fastq_scan_clean_reads/task.log
contains: ["wdl", "theiacov_clearlabs", "fastq_scan_clean_reads", "done"]
- path: miniwdl_run/call-fastq_scan_clean_reads/work/DATE
- path: miniwdl_run/call-fastq_scan_clean_reads/work/READ1_SEQS
md5sum: 097e79b36919c8377c56088363e3d8b7
- path: miniwdl_run/call-fastq_scan_clean_reads/work/VERSION
md5sum: 8e4e9cdfbacc9021a3175ccbbbde002b
md5sum: a59bb42644e35c09b8fa8087156fa4c2
- path: miniwdl_run/call-fastq_scan_clean_reads/work/_miniwdl_inputs/0/clearlabs_R1_dehosted.fastq.gz
- path: miniwdl_run/call-fastq_scan_clean_reads/work/clearlabs_R1_dehosted_fastq-scan.json
md5sum: 869dd2e934c600bba35f30f08e2da7c9
@@ -134,22 +133,21 @@
- path: miniwdl_run/call-fastq_scan_raw_reads/inputs.json
contains: ["read1", "clearlabs"]
- path: miniwdl_run/call-fastq_scan_raw_reads/outputs.json
contains: ["fastq_scan_se", "pipeline_date", "read1_seq"]
contains: ["fastq_scan_se", "read1_seq"]
- path: miniwdl_run/call-fastq_scan_raw_reads/stderr.txt
- path: miniwdl_run/call-fastq_scan_raw_reads/stderr.txt.offset
- path: miniwdl_run/call-fastq_scan_raw_reads/stdout.txt
- path: miniwdl_run/call-fastq_scan_raw_reads/task.log
contains: ["wdl", "theiacov_clearlabs", "fastq_scan_raw_reads", "done"]
- path: miniwdl_run/call-fastq_scan_raw_reads/work/DATE
- path: miniwdl_run/call-fastq_scan_raw_reads/work/READ1_SEQS
md5sum: 097e79b36919c8377c56088363e3d8b7
- path: miniwdl_run/call-fastq_scan_raw_reads/work/VERSION
md5sum: 8e4e9cdfbacc9021a3175ccbbbde002b
md5sum: a59bb42644e35c09b8fa8087156fa4c2
- path: miniwdl_run/call-fastq_scan_raw_reads/work/_miniwdl_inputs/0/clearlabs.fastq.gz
- path: miniwdl_run/call-fastq_scan_raw_reads/work/clearlabs_fastq-scan.json
md5sum: 869dd2e934c600bba35f30f08e2da7c9
- path: miniwdl_run/call-kraken2_dehosted/command
md5sum: 0f9db3341b5f58fb8d145d6d94222827
md5sum: 4306699c67306b103561adf31c3754e3
- path: miniwdl_run/call-kraken2_dehosted/inputs.json
contains: ["read1", "samplename"]
- path: miniwdl_run/call-kraken2_dehosted/outputs.json
@@ -161,18 +159,18 @@
contains: ["wdl", "theiacov_clearlabs", "kraken2_dehosted", "done"]
- path: miniwdl_run/call-kraken2_dehosted/work/DATE
- path: miniwdl_run/call-kraken2_dehosted/work/PERCENT_HUMAN
md5sum: 4fd4dcef994592f9865e9bc8807f32f4
md5sum: 897316929176464ebc9ad085f31e7284
- path: miniwdl_run/call-kraken2_dehosted/work/PERCENT_SC2
md5sum: 9fc4759d176a0e0d240c418dbaaafeb2
md5sum: 86b6b8aa9ad17f169f04c02b0e2bf1b1
- path: miniwdl_run/call-kraken2_dehosted/work/PERCENT_TARGET_ORGANISM
md5sum: 68b329da9893e34099c7d8ad5cb9c940
- path: miniwdl_run/call-kraken2_dehosted/work/VERSION
md5sum: 379b99c23325315c502e74614c035e7d
md5sum: 7ad46f90cd0ffa94f32a6e06299ed05c
- path: miniwdl_run/call-kraken2_dehosted/work/_miniwdl_inputs/0/clearlabs_R1_dehosted.fastq.gz
- path: miniwdl_run/call-kraken2_dehosted/work/clearlabs_kraken2_report.txt
md5sum: 35841fa2d77ec202c275b1de548b8d98
md5sum: b66dbcf8d229c1b6fcfff4dd786068bd
- path: miniwdl_run/call-kraken2_raw/command
md5sum: a9dabf08bff8e183fd792901ce24fc57
md5sum: d6e217901b67290466eec97f13564022
- path: miniwdl_run/call-kraken2_raw/inputs.json
contains: ["read1", "samplename"]
- path: miniwdl_run/call-kraken2_raw/outputs.json
@@ -184,16 +182,16 @@
contains: ["wdl", "theiacov_clearlabs", "kraken2_raw", "done"]
- path: miniwdl_run/call-kraken2_raw/work/DATE
- path: miniwdl_run/call-kraken2_raw/work/PERCENT_HUMAN
md5sum: 4fd4dcef994592f9865e9bc8807f32f4
md5sum: 897316929176464ebc9ad085f31e7284
- path: miniwdl_run/call-kraken2_raw/work/PERCENT_SC2
md5sum: 9fc4759d176a0e0d240c418dbaaafeb2
md5sum: 86b6b8aa9ad17f169f04c02b0e2bf1b1
- path: miniwdl_run/call-kraken2_raw/work/PERCENT_TARGET_ORGANISM
md5sum: 68b329da9893e34099c7d8ad5cb9c940
- path: miniwdl_run/call-kraken2_raw/work/VERSION
md5sum: 379b99c23325315c502e74614c035e7d
md5sum: 7ad46f90cd0ffa94f32a6e06299ed05c
- path: miniwdl_run/call-kraken2_raw/work/_miniwdl_inputs/0/clearlabs.fastq.gz
- path: miniwdl_run/call-kraken2_raw/work/clearlabs_kraken2_report.txt
md5sum: 35841fa2d77ec202c275b1de548b8d98
md5sum: b66dbcf8d229c1b6fcfff4dd786068bd
- path: miniwdl_run/call-ncbi_scrub_se/command
contains: ["read1", "scrubber", "gzip"]
- path: miniwdl_run/call-ncbi_scrub_se/inputs.json
@@ -236,7 +234,7 @@
- path: miniwdl_run/call-nextclade_v3/work/nextclade_dataset_dir/genome_annotation.gff3
md5sum: 4dff84d2d6ada820e0e3a8bc6798d402
- path: miniwdl_run/call-nextclade_v3/work/nextclade_dataset_dir/pathogen.json
md5sum: a51a91e0b5e16590c1afd0c7897ad071
md5sum: 32f20640f926d5b59fed6b954541792d
- path: miniwdl_run/call-nextclade_v3/work/nextclade_dataset_dir/reference.fasta
md5sum: c7ce05f28e4ec0322c96f24e064ef55c
- path: miniwdl_run/call-nextclade_v3/work/nextclade_dataset_dir/sequences.fasta
@@ -310,15 +308,15 @@
- path: miniwdl_run/call-pangolin4/work/PANGOLIN_NOTES
md5sum: 59478efddde2191ead1b46b1f121bbc9
- path: miniwdl_run/call-pangolin4/work/PANGO_ASSIGNMENT_VERSION
md5sum: 0803245359027bd3017d2bd9a9c9c8e3
md5sum: 36f64a1cd7c6844309e8ad2121358088
- path: miniwdl_run/call-pangolin4/work/VERSION_PANGOLIN_ALL
md5sum: b5dbf2ba7480effea8c656099df0e78e
md5sum: dfd90750c8776f46bad1de214c1d1a57
- path: miniwdl_run/call-pangolin4/work/_miniwdl_inputs/0/clearlabs.medaka.consensus.fasta
md5sum: d41d8cd98f00b204e9800998ecf8427e
- path: miniwdl_run/call-pangolin4/work/clearlabs.pangolin_report.csv
md5sum: 151390c419b00ca44eb83e2bbfb96996
md5sum: 0370f24c270c44f6023dd98af79501e7
- path: miniwdl_run/call-stats_n_coverage/command
md5sum: 51da320ddc7de2ffeb263f0ddd85ced6
md5sum: ac020678f99ac145b11d3dbc7b9fe9ba
- path: miniwdl_run/call-stats_n_coverage/inputs.json
contains: ["bamfile", "samplename"]
- path: miniwdl_run/call-stats_n_coverage/outputs.json
@@ -350,7 +348,7 @@
- path: miniwdl_run/call-stats_n_coverage/work/clearlabs.stats.txt
md5sum: bfed5344c91ce6f4db1f688cac0a3ab9
- path: miniwdl_run/call-stats_n_coverage_primtrim/command
md5sum: a84f90b8877babe54bf8c068d244fbe8
md5sum: 2974f886e1959cd5eaae5e495c91f7cc
- path: miniwdl_run/call-stats_n_coverage_primtrim/inputs.json
contains: ["bamfile", "samplename"]
- path: miniwdl_run/call-stats_n_coverage_primtrim/outputs.json
Loading

0 comments on commit 47155c0

Please sign in to comment.