Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve behavior of remote BAM file streaming credential guessing #45

Open
kvg opened this issue Dec 2, 2024 · 0 comments
Open

Improve behavior of remote BAM file streaming credential guessing #45

kvg opened this issue Dec 2, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@kvg
Copy link
Collaborator

kvg commented Dec 2, 2024

For convenience, when running in Terra (app.terra.bio), we automatically try a few things to open a remote BAM file (e.g., renewing Google Cloud authentication tokens, overriding cURL CA bundle, etc.). This works, but it look pretty ugly, especially when the accesses are parallelized. Then every thread has to go through all the warning messages, and it generates a ton of stderr noise that might confuse a user. For example:

[2024-12-02 06:22:55] Hidive version 0.1.95
[2024-12-02 06:22:55] Cli { command: Fetch { output: "/dev/stdout", loci: ["chr22:42,096,498-42,174,483|CYP2D6-CYP2D7"], padding: 500, seq_paths: ["gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam"] } }
[2024-12-02 06:22:55] Intermediate data will be stored at "/cromwell_root/tmp.zW7LSA".
[2024-12-02 06:22:55] Fetching data...
[E::easy_errno] Libcurl reported error 60 (SSL peer certificate or SSH remote key was not OK)
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : Input/output error
[2024-12-02 06:22:57] Read 'gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam', attempt 2 (reauthorizing to GCS)
[E::easy_errno] Libcurl reported error 60 (SSL peer certificate or SSH remote key was not OK)
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : Input/output error
[2024-12-02 06:22:59] Read 'gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam', attempt 3 (overriding cURL CA bundle)
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : No such file or directory
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : No such file or directory
[2024-12-02 06:22:59] Read 'gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam', attempt 2 (reauthorizing to GCS)
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : No such file or directory
[2024-12-02 06:23:01] Read 'gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam', attempt 3 (overriding cURL CA bundle)
[E::hts_open_format] Failed to open file "gs://fc-1ee08173-e353-4494-ad28-7a3d7bd99734/resources/HPRC_grch38/HG00480.bam" : No such file or directory
...

The problem arises in the following code in:

pub fn open_bam(seqs_url: &Url) -> Result<IndexedReader> {
if env::var("GCS_OAUTH_TOKEN").is_err() {
gcs_authorize_data_access();
}
// Try to open the BAM file from the URL, with retries for authorization.
let bam = match IndexedReader::from_url(seqs_url) {
Ok(bam) => bam,
Err(_) => {
crate::elog!("Read '{}', attempt 2 (reauthorizing to GCS)", seqs_url);
// If opening fails, try authorizing access to Google Cloud Storage.
gcs_authorize_data_access();
// Try opening the BAM file again.
match IndexedReader::from_url(seqs_url) {
Ok(bam) => bam,
Err(_) => {
crate::elog!("Read '{}', attempt 3 (overriding cURL CA bundle)", seqs_url);
// If it still fails, guess the cURL CA bundle path.
local_guess_curl_ca_bundle();
// Try one last time to open the BAM file.
IndexedReader::from_url(seqs_url)?
}
}
}
};
Ok(bam)
}

Each time we fail to access the file, we add another level of credential guessing. This is called within a block of code that retries remote file accesses with an exponential backoff, in case the problem is actually intermittent connectivity issues to the data, rather than credentials or local configuration issues. It works for now, but it's ugly and brute-force.

We should improve this behavior. The messages are coming from htslib/rust-htslib, not hidive. So we could possibly capture and suppress stderr from htslib/rust-htslib. Or better yet, we could determine what credential renewals or environment configuration we need to make before trying to open the file, rather than our current strategy trial-and-error strategy.

@kvg kvg added the enhancement New feature or request label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant