Skip to content

Commit

Permalink
Merge pull request #6 from delocalizer/htsget
Browse files Browse the repository at this point in the history
Htsget
  • Loading branch information
JoshuaHarris391 authored Jul 15, 2024
2 parents 20d1d2f + 712254f commit 6c91613
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 7 deletions.
14 changes: 12 additions & 2 deletions .github/workflows/github-pages.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
on:
# Runs on pushes targeting the default branch
push:
branches:
- main # Here source code branch is `master`, it could be other branch
branches: ["main"]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:


# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false

jobs:
fetch_references_update_tools:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ _site/
Gemfile.lock
_local
.idea
.bundle
3 changes: 1 addition & 2 deletions _bibliography/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ @inproceedings{basney_cilogon_2019
volume = {351},
shorttitle = {{CILogon}},
url = {https://pos.sissa.it/351/031},
doi = {10.22323/1.351.0031},
abstract = {CILogon provides a software platform that enables scientists to work together to meet their identity and access management (IAM) needs more effectively so they can allocate more time and effort to their core mission of scientific research. CILogon builds on open source Shibboleth and COmanage software to provide an integrated IAM platform for science, federated worldwide via eduGAIN. CILogon serves the unique needs of research collaborations, namely to dynamically form collaboration groups across organizations and countries, sharing access to data, instruments, compute clusters, and other resources to enable scientific discovery. We operate CILogon via a software-as-a-service model to ease integration with a variety of science applications, while making all CILogon software components publicly available under open source licenses to enable re-use. Since CILogon operations began in 2010, our service has expanded from a federated X.509 certification authority (CA) to an OpenID Connect provider, SAML Attribute Authority, and multi-tenant collaboration platform. In this article, we describe the current CILogon system.},
language = {en},
urldate = {2023-08-22},
Expand All @@ -298,8 +299,6 @@ @inproceedings{basney_cilogon_2019
author = {Basney, Jim and Flanagan, Heather and Fleury, Terry and Gaynor, Jeff and Koranda, Scott and Oshrin, Benn},
month = nov,
year = {2019},
doi = {10.22323/1.351.0031},
note = {Conference Name: International Symposium on Grids \& Clouds 2019},
pages = {031},
}

Expand Down
9 changes: 8 additions & 1 deletion _data/CONTRIBUTORS.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,11 @@ Irene Hung:
email: [email protected]
orcid: 0000-0002-6139-7980
role: editor
affiliation: Australian BioCommons / University of Melbourne
affiliation: Australian BioCommons / University of Melbourne

Conrad Leonard:
git: delocalizer
email: [email protected]
orcid: 0000-0002-4131-2065
role: editor
affiliation: Australian BioCommons / University of Melbourne
Binary file added images/tools/htsget_figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 41 additions & 2 deletions pages/technologies_standards/htsget.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,45 @@ page_id: htsget
type: technologies_standards
toc: true
description: GA4GH standard for streaming of genomic files
contributors: [Marion Shadbolt]
contributors: [Marion Shadbolt, Conrad Leonard]
affiliations: [GA4GH]
---
---

**htsget** is a data transfer protocol designed to facilitate efficient and secure access to genomic data stored in various formats such as BAM, CRAM, and VCF. The fundamental goal of htsget is to introduce a standardized interface for requesting and delivering genomic data that is not bound by file semantics.

### Overview of htsget

**htsget** (HTTP Sequence GET) is a protocol developed under the Global Alliance for Genomics and Health (GA4GH). It allows clients to request genomic data over HTTP in a manner that is efficient and secure. The protocol is designed to support the retrieval of specific data regions without the need to download entire files, which is crucial for working with large genomic datasets.

### What it is not
The protocol does not attempt to provide an end-to-end solution for managing genomic data. Issues around the organization of metadata and data discovery are outside the scope of this protocol. The protocol explicitly does not provide a way to discover the identifiers for valid ReadGroupSets — clients obtain these via some out of band mechanism.

### Key Features of htsget

1. **Region-Specific Data Retrieval**: htsget allows clients to specify genomic regions of interest. This means users can efficiently retrieve only the parts of the data they need, rather than downloading entire files, significantly reducing data transfer times and bandwidth usage.

2. **Standardized API**: The protocol defines a standard API for requesting data. This standardization ensures interoperability between different systems and tools within the genomics community.

3. **Security**: The protocol supports secure data transfer mechanisms, ensuring that sensitive genomic data is protected during transmission.

### How htsget Works

The key mechanic of the protocol is that the client provides a URL (determined via another discovery service), a preferred format and an optional genomic range via a HTTP(s) GET request (Fig. 1). The server returns a small JSON block with a list of URLs. The client downloads the data from the URLs, concatenates the downloaded data in the order provided by the server to produce the full result of their query.

{% include image.html file="/tools/htsget_figure_1.png" caption="Figure 1. Schematic of htsget protocol" alt="Schematic of htsget protocol" max-width="100" %}


1. **Client Request**: A client (e.g., a researcher or an analysis tool) makes a request to an htsget server, specifying the genomic regions and data formats they need.

2. **Server Response**: The htsget server processes the request and generates URLs for the requested data segments. These URLs are provided to the client, often along with metadata about the data segments.

3. **Data Retrieval**: The client uses the provided URLs to download the specific data segments. This process can be repeated as needed to retrieve additional regions or data formats.

### Example Scenario

Suppose a researcher is interested in a specific gene on chromosome 12. Instead of downloading the entire BAM file containing sequencing data for a whole genome, the researcher can use htsget to request only the data for the specific region of chromosome 12. The htsget server will respond with URLs to the specific data segments, which the researcher can then download and analyze.


### References
* [Original paper](https://doi.org/10.1093/bioinformatics/bty492)
* [Current version of the spec](https://samtools.github.io/hts-specs/htsget.html)

0 comments on commit 6c91613

Please sign in to comment.