Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Retrieve_SRR_Metadata] New wf to retrieve SRR after Terra2NCBI wf #668

Merged
merged 28 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
71a3ed6
inital commit part 1 retrieve srr from Biosample
fraser-combe Oct 29, 2024
fe2914a
update task and wf names and meta
fraser-combe Nov 4, 2024
47890c8
dockstore add
fraser-combe Nov 4, 2024
3f2ac27
Documentation and update column name
fraser-combe Nov 4, 2024
2bb0df4
update dockstore name
fraser-combe Nov 4, 2024
de5b45d
fraser-combe Nov 4, 2024
4eeb546
Remove unnecessary blank lines in fetch_srr_metadata WDL task
fraser-combe Nov 7, 2024
a0b9fec
Update SRR metadata workflow and documentation for clarity and accuracy
fraser-combe Nov 8, 2024
71a17fd
Remove redundant docker input from wf_update_srr_metadata workflow
fraser-combe Nov 8, 2024
2564f59
update
fraser-combe Nov 8, 2024
e99ea72
update dockstore
fraser-combe Nov 14, 2024
3ebb9fe
initial updates
fraser-combe Nov 14, 2024
8c80fec
handle multiple SRR accessionss as string version outputs
fraser-combe Nov 15, 2024
690ab6a
update task path
fraser-combe Nov 15, 2024
2799eaf
forgot to import task versioning
fraser-combe Nov 15, 2024
705c766
update dockstore yml
fraser-combe Nov 15, 2024
3ec8105
comma sep output as string instead of array
fraser-combe Nov 18, 2024
cbf6bcf
update wf name
fraser-combe Nov 18, 2024
26f5c4f
test local worked
fraser-combe Nov 18, 2024
7bcc842
set euo pipefail
fraser-combe Nov 20, 2024
988fc17
more explicit fail invalid biosample
fraser-combe Nov 21, 2024
f00cdd0
update logic failure
fraser-combe Nov 22, 2024
f9de101
logic handling valid biosample or SRA
fraser-combe Nov 22, 2024
26d8c49
enhance error handling and logging for biosample ID or SRA fetching
fraser-combe Nov 22, 2024
e186b30
Update logic for no SRR accessions and invalid samples
fraser-combe Nov 22, 2024
4995aa2
update docs version in table
fraser-combe Nov 22, 2024
e4a5bec
add sample level to docs
fraser-combe Nov 25, 2024
6cba0fc
update input and ouptut tables
fraser-combe Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,11 @@ workflows:
primaryDescriptorPath: /workflows/utilities/data_import/wf_terra_2_bq.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Fetch_SRR_Accession_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/data_import/wf_fetch_srr_accession.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Concatenate_Column_Content_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/file_handling/wf_concatenate_column.wdl
Expand Down
52 changes: 52 additions & 0 deletions docs/workflows/public_data_sharing/fetch_srr_accession.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Fetch SRR Accession Workflow

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.3.0 | Yes | Sample-level |

## Fetch SRR Accession

This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. The primary inputs are BioSample IDs (e.g., SAMN00000000) or SRA Experiment IDs (e.g., SRX000000), which link to sequencing data in the SRA repository.

The workflow uses the fastq-dl tool to fetch metadata from SRA and specifically parses this metadata to extract the associated SRR accession and outputs the SRR accession.

### Inputs

| **Terra Task Name** | **Variable** | **Type** | **Description**| **Default Value** | **Terra Status** |
| --- | --- | --- | --- | --- | --- |
| fetch_srr_metadata | **sample_accession** | String | SRA-compatible accession, such as a **BioSample ID** (e.g., "SAMN00000000") or **SRA Experiment ID** (e.g., "SRX000000"), used to retrieve SRR metadata. | | Required |
| fetch_srr_metadata | **cpu** | Int | Number of CPUs allocated for the task. | 2 | Optional |
fraser-combe marked this conversation as resolved.
Show resolved Hide resolved
| fetch_srr_metadata | **disk_size** | Int | Disk space in GB allocated for the task. | 10 | Optional |
| fetch_srr_metadata | **docker**| String | Docker image for metadata retrieval. | `us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0` | Optional |
| fetch_srr_metadata | **memory** | Int | Memory in GB allocated for the task. | 8 | Optional |
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

### Workflow Tasks

This workflow has a single task that performs metadata retrieval for the specified sample accession.

??? task "`fastq-dl`: Fetches SRR metadata for sample accession"
When provided a BioSample accession or SRA experiment ID, 'fastq-dl' collects metadata and returns the appropriate SRR accession.

!!! techdetails "fastq-dl Technical Details"
| | Links |
| --- | --- |
| Task | [Task on GitHub](https://github.com/theiagen-org/phb-workflows/blob/main/tasks/utilities/data_handling/task_fetch_srr_metadata.wdl) |
| Software Source Code | [fastq-dl Source](https://github.com/rvalieris/fastq-dl) |
| Software Documentation | [fastq-dl Documentation](https://github.com/rvalieris/fastq-dl#documentation) |
| Original Publication | [fastq-dl: A fast and reliable tool for downloading SRA metadata](https://doi.org/10.1186/s12859-021-04346-3) |

### Outputs

| **Variable** | **Type** | **Description**|
|---|---|---|
| srr_accession| String | The SRR accession's associated with the input sample accession.|
fraser-combe marked this conversation as resolved.
Show resolved Hide resolved
| fetch_srr_accession_version | String | The version of the fetch_srr_accession workflow. |
| fetch_srr_accession_analysis_date | String | The date the fetch_srr_accession analysis was run. |

## References

> Valieris, R. et al., "fastq-dl: A fast and reliable tool for downloading SRA metadata." Bioinformatics, 2021.
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_alphabetically.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ title: Alphabetical Workflows
| [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) |
| [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) |
| [**Samples_to_Ref_Tree**](../workflows/phylogenetic_placement/usher.md)| Use UShER to rapidly and accurately place your samples on any existing phylogenetic tree | Monkeypox virus, SARS-CoV-2, Viral | Sample-level, Set-level | Yes | v2.1.0 | [Usher_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Usher_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [*Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |
| [**Usher_PHB**](../workflows/genomic_characterization/vadr_update.md)| Update VADR assignments | HAV, Influenza, Monkeypox virus, RSV-A, RSV-B, SARS-CoV-2, Viral, WNV | Sample-level | Yes | v1.2.1 | [VADR_Update_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/VADR_Update_PHB:main?tab=info) |
| [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) |

Expand Down
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_kingdom.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ title: Workflows by Kingdom
| [**TheiaMeta**](../workflows/genomic_characterization/theiameta.md) | Genome assembly and QC from metagenomic sequencing | Any taxa | Sample-level | Yes | v2.0.0 | [TheiaMeta_Illumina_PE_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaMeta_Illumina_PE_PHB:main?tab=info) |
| [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) |
| [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | Set-level | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |
| [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) |

</div>
Expand Down
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_type.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ title: Workflows by Type
| [**Mercury_Prep_N_Batch**](../workflows/public_data_sharing/mercury_prep_n_batch.md)| Prepare metadata and sequence data for submission to NCBI and GISAID | Influenza, Monkeypox virus, SARS-CoV-2, Viral | Set-level | No | v2.2.0 | [Mercury_Prep_N_Batch_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Mercury_Prep_N_Batch_PHB:main?tab=info) |
| [**Terra_2_GISAID**](../workflows/public_data_sharing/terra_2_gisaid.md)| Upload of assembly data to GISAID | SARS-CoV-2, Viral | Set-level | Yes | v1.2.1 | [Terra_2_GISAID_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_GISAID_PHB:main?tab=info) |
| [**Terra_2_NCBI**](../workflows/public_data_sharing/terra_2_ncbi.md)| Upload of sequence data to NCBI | Bacteria, Mycotics, Viral | Set-level | No | v2.1.0 | [Terra_2_NCBI_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_NCBI_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |

</div>

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ nav:
- Samples_to_Ref_Tree: workflows/phylogenetic_placement/samples_to_ref_tree.md
- Usher_PHB: workflows/phylogenetic_placement/usher.md
- Public Data Sharing:
- Fetch_SRR_Accession: workflows/public_data_sharing/fetch_srr_accession.md
- Mercury_Prep_N_Batch: workflows/public_data_sharing/mercury_prep_n_batch.md
- Terra_2_GISAID: workflows/public_data_sharing/terra_2_gisaid.md
- Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md
Expand Down
62 changes: 62 additions & 0 deletions tasks/utilities/data_handling/task_fetch_srr_accession.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
version 1.0

task fetch_srr_accession {
input {
String sample_accession
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0"
Int disk_size = 10
Int cpu = 2
Int memory = 8
}
meta {
volatile: true
}
command <<<
set -euo pipefail

# Output the current date and fastq-dl version for debugging
date -u | tee DATE
fastq-dl --version | tee VERSION

echo "Fetching metadata for accession: ~{sample_accession}"

# Run fastq-dl and capture stderr
fastq-dl --accession ~{sample_accession} --only-download-metadata -m 2 --verbose 2> stderr.log || true

# Handle whether the ID/accession is valid and contains SRR metadata based on stderr
if grep -q "No results found for" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "received an empty response" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "is not a Study, Sample, Experiment, or Run accession" stderr.log; then
echo "Invalid accession: ~{sample_accession}" >&2
exit 1
elif [[ ! -f fastq-run-info.tsv ]]; then
echo "No metadata file found for accession: ~{sample_accession}" >&2
exit 1
else
# Extract SRR accessions from the TSV file if it exists
SRR_accessions=$(awk -F'\t' 'NR>1 {print $1}' fastq-run-info.tsv | paste -sd ',' -)
if [[ -z "${SRR_accessions}" ]]; then
echo "No SRR accession found" > srr_accession.txt
else
echo "Extracted SRR accessions: ${SRR_accessions}"
echo "${SRR_accessions}" > srr_accession.txt
fi
fi
>>>
output {
String srr_accession = read_string("srr_accession.txt")
String fastq_dl_version = read_string("VERSION")
}
runtime {
docker: docker
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
preemptible: 1
}
}
26 changes: 26 additions & 0 deletions workflows/utilities/data_import/wf_fetch_srr_accession.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
version 1.0

import "../../../tasks/utilities/data_handling/task_fetch_srr_accession.wdl" as srr_task
import "../../../tasks/task_versioning.wdl" as versioning_task

workflow fetch_srr_accession {
meta {
description: "This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. It uses the fastq-dl tool to fetch metadata from SRA and outputs the SRR accession."
}
input {
String sample_accession
}
call versioning_task.version_capture {
input:
}
call srr_task.fetch_srr_accession as fetch_srr {
input:
sample_accession = sample_accession
}
output {
String srr_accession = fetch_srr.srr_accession
# Version Captures
String fetch_srr_accession_version = version_capture.phb_version
String fetch_srr_accession_analysis_date = version_capture.date
}
}