Merge pull request #6 from delocalizer/htsget

Htsget
AustralianBioCommons · Jul 15, 2024 · 6c91613 · 6c91613
2 parents 20d1d2f + 712254f
commit 6c91613
Show file tree

Hide file tree

Showing 6 changed files with 63 additions and 7 deletions.
diff --git a/.github/workflows/github-pages.yml b/.github/workflows/github-pages.yml
@@ -1,7 +1,17 @@
 on:
+  # Runs on pushes targeting the default branch
   push:
-    branches:
-      - main  # Here source code branch is `master`, it could be other branch
+    branches: ["main"]
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
 
 jobs:
   fetch_references_update_tools:

diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,4 @@ _site/
 Gemfile.lock
 _local
 .idea
+.bundle
diff --git a/_bibliography/references.bib b/_bibliography/references.bib
@@ -290,6 +290,7 @@ @inproceedings{basney_cilogon_2019
 	volume = {351},
 	shorttitle = {{CILogon}},
 	url = {https://pos.sissa.it/351/031},
+	doi = {10.22323/1.351.0031},
 	abstract = {CILogon provides a software platform that enables scientists to work together to meet their identity and access management (IAM) needs more effectively so they can allocate more time and effort to their core mission of scientific research. CILogon builds on open source Shibboleth and COmanage software to provide an integrated IAM platform for science, federated worldwide via eduGAIN. CILogon serves the unique needs of research collaborations, namely to dynamically form collaboration groups across organizations and countries, sharing access to data, instruments, compute clusters, and other resources to enable scientific discovery. We operate CILogon via a software-as-a-service model to ease integration with a variety of science applications, while making all CILogon software components publicly available under open source licenses to enable re-use. Since CILogon operations began in 2010, our service has expanded from a federated X.509 certification authority (CA) to an OpenID Connect provider, SAML Attribute Authority, and multi-tenant collaboration platform. In this article, we describe the current CILogon system.},
 	language = {en},
 	urldate = {2023-08-22},
@@ -298,8 +299,6 @@ @inproceedings{basney_cilogon_2019
 	author = {Basney, Jim and Flanagan, Heather and Fleury, Terry and Gaynor, Jeff and Koranda, Scott and Oshrin, Benn},
 	month = nov,
 	year = {2019},
-	doi = {10.22323/1.351.0031},
-	note = {Conference Name: International Symposium on Grids \&amp; Clouds 2019},
 	pages = {031},
 }
 

diff --git a/_data/CONTRIBUTORS.yml b/_data/CONTRIBUTORS.yml
@@ -37,4 +37,11 @@ Irene Hung:
     email: [email protected]
     orcid: 0000-0002-6139-7980
     role: editor
-    affiliation: Australian BioCommons / University of Melbourne
+    affiliation: Australian BioCommons / University of Melbourne
+
+Conrad Leonard:
+    git: delocalizer
+    email: [email protected]
+    orcid: 0000-0002-4131-2065
+    role: editor
+    affiliation: Australian BioCommons / University of Melbourne
diff --git a/images/tools/htsget_figure_1.png b/images/tools/htsget_figure_1.png
diff --git a/pages/technologies_standards/htsget.md b/pages/technologies_standards/htsget.md
@@ -4,6 +4,45 @@ page_id: htsget
 type: technologies_standards
 toc: true
 description: GA4GH standard for streaming of genomic files
-contributors: [Marion Shadbolt]
+contributors: [Marion Shadbolt, Conrad Leonard]
 affiliations: [GA4GH]
----
+---
+
+**htsget** is a data transfer protocol designed to facilitate efficient and secure access to genomic data stored in various formats such as BAM, CRAM, and VCF. The fundamental goal of htsget is to introduce a standardized interface for requesting and delivering genomic data that is not bound by file semantics.
+
+### Overview of htsget
+
+**htsget** (HTTP Sequence GET) is a protocol developed under the Global Alliance for Genomics and Health (GA4GH). It allows clients to request genomic data over HTTP in a manner that is efficient and secure. The protocol is designed to support the retrieval of specific data regions without the need to download entire files, which is crucial for working with large genomic datasets.
+
+### What it is not
+The protocol does not attempt to provide an end-to-end solution for managing genomic data. Issues around the organization of metadata and data discovery are outside the scope of this protocol. The protocol explicitly does not provide a way to discover the identifiers for valid ReadGroupSets — clients obtain these via some out of band mechanism.
+
+### Key Features of htsget
+
+1. **Region-Specific Data Retrieval**: htsget allows clients to specify genomic regions of interest. This means users can efficiently retrieve only the parts of the data they need, rather than downloading entire files, significantly reducing data transfer times and bandwidth usage.
+
+2. **Standardized API**: The protocol defines a standard API for requesting data. This standardization ensures interoperability between different systems and tools within the genomics community.
+
+3. **Security**: The protocol supports secure data transfer mechanisms, ensuring that sensitive genomic data is protected during transmission.
+
+### How htsget Works
+
+The key mechanic of the protocol is that the client provides a URL (determined via another discovery service), a preferred format and an optional genomic range via a HTTP(s) GET request (Fig. 1). The server returns a small JSON block with a list of URLs. The client downloads the data from the URLs, concatenates the downloaded data in the order provided by the server to produce the full result of their query.
+
+{% include image.html file="/tools/htsget_figure_1.png" caption="Figure 1. Schematic of htsget protocol" alt="Schematic of htsget protocol" max-width="100" %}
+
+
+1. **Client Request**: A client (e.g., a researcher or an analysis tool) makes a request to an htsget server, specifying the genomic regions and data formats they need.
+
+2. **Server Response**: The htsget server processes the request and generates URLs for the requested data segments. These URLs are provided to the client, often along with metadata about the data segments.
+
+3. **Data Retrieval**: The client uses the provided URLs to download the specific data segments. This process can be repeated as needed to retrieve additional regions or data formats.
+
+### Example Scenario
+
+Suppose a researcher is interested in a specific gene on chromosome 12. Instead of downloading the entire BAM file containing sequencing data for a whole genome, the researcher can use htsget to request only the data for the specific region of chromosome 12. The htsget server will respond with URLs to the specific data segments, which the researcher can then download and analyze.
+
+
+### References
+* [Original paper](https://doi.org/10.1093/bioinformatics/bty492)
+* [Current version of the spec](https://samtools.github.io/hts-specs/htsget.html)
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,3 +5,4 @@ _site/ @@
     Gemfile.lock
     _local
     .idea
+    .bundle