diff --git a/.dockstore.yml b/.dockstore.yml index 5306d30ed..146638eb7 100644 --- a/.dockstore.yml +++ b/.dockstore.yml @@ -195,6 +195,11 @@ workflows: primaryDescriptorPath: /workflows/utilities/data_import/wf_terra_2_bq.wdl testParameterFiles: - /tests/inputs/empty.json + - name: Fetch_SRR_Accession_PHB + subclass: WDL + primaryDescriptorPath: /workflows/utilities/data_import/wf_fetch_srr_accession.wdl + testParameterFiles: + - /tests/inputs/empty.json - name: Concatenate_Column_Content_PHB subclass: WDL primaryDescriptorPath: /workflows/utilities/file_handling/wf_concatenate_column.wdl diff --git a/docs/workflows/public_data_sharing/fetch_srr_accession.md b/docs/workflows/public_data_sharing/fetch_srr_accession.md new file mode 100644 index 000000000..aa18c6438 --- /dev/null +++ b/docs/workflows/public_data_sharing/fetch_srr_accession.md @@ -0,0 +1,52 @@ +# Fetch SRR Accession Workflow + +## Quick Facts + +| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** | +|---|---|---|---|---| +| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.3.0 | Yes | Sample-level | + +## Fetch SRR Accession + +This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. The primary inputs are BioSample IDs (e.g., SAMN00000000) or SRA Experiment IDs (e.g., SRX000000), which link to sequencing data in the SRA repository. + +The workflow uses the fastq-dl tool to fetch metadata from SRA and specifically parses this metadata to extract the associated SRR accession and outputs the SRR accession. + +### Inputs + +| **Terra Task Name** | **Variable** | **Type** | **Description**| **Default Value** | **Terra Status** | +| --- | --- | --- | --- | --- | --- | +| fetch_srr_metadata | **sample_accession** | String | SRA-compatible accession, such as a **BioSample ID** (e.g., "SAMN00000000") or **SRA Experiment ID** (e.g., "SRX000000"), used to retrieve SRR metadata. | | Required | +| fetch_srr_metadata | **cpu** | Int | Number of CPUs allocated for the task. | 2 | Optional | +| fetch_srr_metadata | **disk_size** | Int | Disk space in GB allocated for the task. | 10 | Optional | +| fetch_srr_metadata | **docker**| String | Docker image for metadata retrieval. | `us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0` | Optional | +| fetch_srr_metadata | **memory** | Int | Memory in GB allocated for the task. | 8 | Optional | +| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | +| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional | + +### Workflow Tasks + +This workflow has a single task that performs metadata retrieval for the specified sample accession. + +??? task "`fastq-dl`: Fetches SRR metadata for sample accession" + When provided a BioSample accession or SRA experiment ID, 'fastq-dl' collects metadata and returns the appropriate SRR accession. + + !!! techdetails "fastq-dl Technical Details" + | | Links | + | --- | --- | + | Task | [Task on GitHub](https://github.com/theiagen-org/phb-workflows/blob/main/tasks/utilities/data_handling/task_fetch_srr_metadata.wdl) | + | Software Source Code | [fastq-dl Source](https://github.com/rvalieris/fastq-dl) | + | Software Documentation | [fastq-dl Documentation](https://github.com/rvalieris/fastq-dl#documentation) | + | Original Publication | [fastq-dl: A fast and reliable tool for downloading SRA metadata](https://doi.org/10.1186/s12859-021-04346-3) | + +### Outputs + +| **Variable** | **Type** | **Description**| +|---|---|---| +| srr_accession| String | The SRR accession's associated with the input sample accession.| +| fetch_srr_accession_version | String | The version of the fetch_srr_accession workflow. | +| fetch_srr_accession_analysis_date | String | The date the fetch_srr_accession analysis was run. | + +## References + +> Valieris, R. et al., "fastq-dl: A fast and reliable tool for downloading SRA metadata." Bioinformatics, 2021. diff --git a/docs/workflows_overview/workflows_alphabetically.md b/docs/workflows_overview/workflows_alphabetically.md index 3543d3cb9..c937e815b 100644 --- a/docs/workflows_overview/workflows_alphabetically.md +++ b/docs/workflows_overview/workflows_alphabetically.md @@ -47,6 +47,7 @@ title: Alphabetical Workflows | [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) | | [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) | | [**Samples_to_Ref_Tree**](../workflows/phylogenetic_placement/usher.md)| Use UShER to rapidly and accurately place your samples on any existing phylogenetic tree | Monkeypox virus, SARS-CoV-2, Viral | Sample-level, Set-level | Yes | v2.1.0 | [Usher_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Usher_PHB:main?tab=info) | +| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [*Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) | | [**Usher_PHB**](../workflows/genomic_characterization/vadr_update.md)| Update VADR assignments | HAV, Influenza, Monkeypox virus, RSV-A, RSV-B, SARS-CoV-2, Viral, WNV | Sample-level | Yes | v1.2.1 | [VADR_Update_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/VADR_Update_PHB:main?tab=info) | | [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) | diff --git a/docs/workflows_overview/workflows_kingdom.md b/docs/workflows_overview/workflows_kingdom.md index c77c7bc3d..d10fa2afd 100644 --- a/docs/workflows_overview/workflows_kingdom.md +++ b/docs/workflows_overview/workflows_kingdom.md @@ -24,6 +24,7 @@ title: Workflows by Kingdom | [**TheiaMeta**](../workflows/genomic_characterization/theiameta.md) | Genome assembly and QC from metagenomic sequencing | Any taxa | Sample-level | Yes | v2.0.0 | [TheiaMeta_Illumina_PE_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaMeta_Illumina_PE_PHB:main?tab=info) | | [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) | | [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) | +| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | Set-level | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) | | [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) | diff --git a/docs/workflows_overview/workflows_type.md b/docs/workflows_overview/workflows_type.md index 53623d7ee..14f23fd92 100644 --- a/docs/workflows_overview/workflows_type.md +++ b/docs/workflows_overview/workflows_type.md @@ -75,6 +75,7 @@ title: Workflows by Type | [**Mercury_Prep_N_Batch**](../workflows/public_data_sharing/mercury_prep_n_batch.md)| Prepare metadata and sequence data for submission to NCBI and GISAID | Influenza, Monkeypox virus, SARS-CoV-2, Viral | Set-level | No | v2.2.0 | [Mercury_Prep_N_Batch_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Mercury_Prep_N_Batch_PHB:main?tab=info) | | [**Terra_2_GISAID**](../workflows/public_data_sharing/terra_2_gisaid.md)| Upload of assembly data to GISAID | SARS-CoV-2, Viral | Set-level | Yes | v1.2.1 | [Terra_2_GISAID_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_GISAID_PHB:main?tab=info) | | [**Terra_2_NCBI**](../workflows/public_data_sharing/terra_2_ncbi.md)| Upload of sequence data to NCBI | Bacteria, Mycotics, Viral | Set-level | No | v2.1.0 | [Terra_2_NCBI_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_NCBI_PHB:main?tab=info) | +| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) | diff --git a/mkdocs.yml b/mkdocs.yml index cc90e4e3d..613f81b15 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -43,6 +43,7 @@ nav: - Samples_to_Ref_Tree: workflows/phylogenetic_placement/samples_to_ref_tree.md - Usher_PHB: workflows/phylogenetic_placement/usher.md - Public Data Sharing: + - Fetch_SRR_Accession: workflows/public_data_sharing/fetch_srr_accession.md - Mercury_Prep_N_Batch: workflows/public_data_sharing/mercury_prep_n_batch.md - Terra_2_GISAID: workflows/public_data_sharing/terra_2_gisaid.md - Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md diff --git a/tasks/utilities/data_handling/task_fetch_srr_accession.wdl b/tasks/utilities/data_handling/task_fetch_srr_accession.wdl new file mode 100644 index 000000000..ab8f98440 --- /dev/null +++ b/tasks/utilities/data_handling/task_fetch_srr_accession.wdl @@ -0,0 +1,62 @@ +version 1.0 + +task fetch_srr_accession { + input { + String sample_accession + String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0" + Int disk_size = 10 + Int cpu = 2 + Int memory = 8 + } + meta { + volatile: true + } + command <<< + set -euo pipefail + + # Output the current date and fastq-dl version for debugging + date -u | tee DATE + fastq-dl --version | tee VERSION + + echo "Fetching metadata for accession: ~{sample_accession}" + + # Run fastq-dl and capture stderr + fastq-dl --accession ~{sample_accession} --only-download-metadata -m 2 --verbose 2> stderr.log || true + + # Handle whether the ID/accession is valid and contains SRR metadata based on stderr + if grep -q "No results found for" stderr.log; then + echo "No SRR accession found" > srr_accession.txt + echo "No SRR accession found for accession: ~{sample_accession}" + elif grep -q "received an empty response" stderr.log; then + echo "No SRR accession found" > srr_accession.txt + echo "No SRR accession found for accession: ~{sample_accession}" + elif grep -q "is not a Study, Sample, Experiment, or Run accession" stderr.log; then + echo "Invalid accession: ~{sample_accession}" >&2 + exit 1 + elif [[ ! -f fastq-run-info.tsv ]]; then + echo "No metadata file found for accession: ~{sample_accession}" >&2 + exit 1 + else + # Extract SRR accessions from the TSV file if it exists + SRR_accessions=$(awk -F'\t' 'NR>1 {print $1}' fastq-run-info.tsv | paste -sd ',' -) + if [[ -z "${SRR_accessions}" ]]; then + echo "No SRR accession found" > srr_accession.txt + else + echo "Extracted SRR accessions: ${SRR_accessions}" + echo "${SRR_accessions}" > srr_accession.txt + fi + fi + >>> + output { + String srr_accession = read_string("srr_accession.txt") + String fastq_dl_version = read_string("VERSION") + } + runtime { + docker: docker + memory: "~{memory} GB" + cpu: cpu + disks: "local-disk " + disk_size + " SSD" + disk: disk_size + " GB" + preemptible: 1 + } +} diff --git a/workflows/utilities/data_import/wf_fetch_srr_accession.wdl b/workflows/utilities/data_import/wf_fetch_srr_accession.wdl new file mode 100644 index 000000000..e40e54a0f --- /dev/null +++ b/workflows/utilities/data_import/wf_fetch_srr_accession.wdl @@ -0,0 +1,26 @@ +version 1.0 + +import "../../../tasks/utilities/data_handling/task_fetch_srr_accession.wdl" as srr_task +import "../../../tasks/task_versioning.wdl" as versioning_task + +workflow fetch_srr_accession { + meta { + description: "This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. It uses the fastq-dl tool to fetch metadata from SRA and outputs the SRR accession." + } + input { + String sample_accession + } + call versioning_task.version_capture { + input: + } + call srr_task.fetch_srr_accession as fetch_srr { + input: + sample_accession = sample_accession + } + output { + String srr_accession = fetch_srr.srr_accession + # Version Captures + String fetch_srr_accession_version = version_capture.phb_version + String fetch_srr_accession_analysis_date = version_capture.date + } +}