Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical: Implement support for Terra Google Requester Pays #42

Closed
mbaumann-broad opened this issue Nov 7, 2023 · 5 comments · Fixed by #41
Closed

Critical: Implement support for Terra Google Requester Pays #42

mbaumann-broad opened this issue Nov 7, 2023 · 5 comments · Fixed by #41
Labels
1 First scheduled issue(s) high priority Should be completed ASAP

Comments

@mbaumann-broad
Copy link
Collaborator

mbaumann-broad commented Nov 7, 2023

Description

A key reason for the creation of the AnVIL drs_downloader is to ensure that AnVIL data downloads (out of the cloud, to local/institutional systems) are paid for by the user/downloader. Google Cloud Platform supports requester pays functionality, and the Terra Data Repository (TDR) has been enhanced to support the use of requester pays with AnVIL DRS signed URL data access.
Note: Azure cloud does not (yet) support requester pays functionality. Broad wants Microsoft to add support for that.

A brief list of the key changes is as follows:

  • For the drs_downloader for Terra, add a command-line option for the Terra Workspace Google project ID to be charged for the data access. For example:
    drs_downloader terra --user-project my-project-id
  • If the input DRS URIs include AnVIL DRS URIs, require that that command-line option above be provided
  • When requesting the signed URL for an AnVIL DRS URI (only), include the value of the command-line option (e.g., my-project-id) as the value for the header x-user-project. For example, add the header:
    x-user-project: terra-3b325c63

Implementation

Key detailed technical information required to implement this feature is provided in the following previously shared documents:

Acceptance Testing

Acceptance testing must include all of the following:

  • If and only if AnVIL DRS URIs are provided as input, then the --user-project command-line option and value are required. In this case, if the option is not provided, a clear error message is provided to the user regarding the error and how to correct it. This check should be performed early in the execution of the drs_downloader, before actual downloading begins.
  • When downloading (specifically when requesting the signed URL for) AnVIL DRS URIs (only), the x-user-project header with the associated Google project id value is included in the request to DRSHub. The correct operation may be verified by checking the returned signed URL for the query parameter userProject having the value provided on the command line.
  • Charges for AnVIL data downloaded can be verified by checking Google Billing information for the expected charges (I can help with this verification if necessary)
  • Errors resulting from the user providing a Google project ID that is not a Terra Workspace Google project ID are handled gracefully, with a clear error message to the user, including information about what is wrong and how to correct it.
@matthewpeterkort
Copy link
Collaborator

"The correct operation may be verified by checking the returned signed URL for the query parameter userProject having the value provided on the command line." Do you want this to be added into the codebase? Is this a way to validate the project id?

On branch feature/testing-dev to see new requestor pays code:

x-user-project headers are passed here:
https://github.com/anvilproject/drs_downloader/blob/feature/testing-dev/drs_downloader/clients/terra.py#L142

If there is a way to validate the project id using the terra DRSHUB backend, then many of the issues here and discussed below can be resolved. Currently the project_id is "verified" when DRSHUB throws an error when attempting to download data without a valid project id. Not an ideal solution.

New tests that address new commits can be found at:
tests/integration/test_payers.py, specifically:

test_terra_bad_project_id

(venv) peterkor@RNB11238 drs_downloader % drs_download terra -d DATA -m tests/fixtures/terra-data.tsv --user-project terra-abcdefgh
2023-11-07T11:59:11 PST Downloading to: /Users/peterkor/Desktop/downloadger/drs_downloader/DATA
2023-11-07T11:59:11 PST gcloud token successfully fetched
retrieving get_object information: 9it [00:00, 123361.88it/s]
2023-11-07T11:59:18 PST Total download size is 10 MB
2023-11-07T11:59:18 PST Estimated download cost is $0.1
retrieving sign_url information: 9it [00:00, 209715.20it/s]
2023-11-07T11:59:20 PST 0/9 files have downloaded successfully
2023-11-07T11:59:20 PST ('CCDG_13607_Project_CCDG_13607_B01_GRM_WGS.cram.2019-02-06_Sample_NA19062_analysis_NA19062.final.cram.crai', 'ERROR', 1267131, 2, ['User project specified in --user-project option is invalid'])
2023-11-07T11:59:20 PST  ... 

shows the scenario where an entire manifest of anVIL drs_uris are given with a invalid project id:
Every object hits:
https://github.com/anvilproject/drs_downloader/blob/feature/testing-dev/drs_downloader/clients/terra.py#L79
when attempting to begin downloading the file and then gracefully exits, since Terra DRSHUB told the client that the project-id is no good.

test_terra_bad_project_id_mixed_data
Shows a very similar example except this time non anVIL provider pays data is also being downloaded in conjunction with the AnVIL data so the provider pays data is downloaded and the requestor pays data hits the same error state described above.

Would you rather that no download begins if 1 anVIL download object is attempting to download without a valid project id?
If so, there will probably need to be some way to validate all requestor pays project ids before starting downloads. Not sure how to do that right now.

test_no_project_id_specified_mixed_data
This third test tests when user attempts to download mixed data and provides no project id at all.
Similar result, data that doesn't require a project id gets downloaded other data exits gracefully. Same questions apply^^

@lbeckman314 lbeckman314 added high priority Should be completed ASAP 1 First scheduled issue(s) labels Nov 7, 2023
@mbaumann-broad
Copy link
Collaborator Author

Regarding:

"The correct operation may be verified by checking the returned signed URL for the query parameter userProject having the value provided on the command line."
Do you want this to be added into the codebase?
Is this a way to validate the project id?

Good question. Ideally, incorporating this check into the codebase would not be necessary. Yet, the signed URL is not exposed outside of the drs_downloader to test it externally.

It is very important to verify somehow.

Logging signed URLs is a bad practice for security reasons.

A Python assert could be used, yet I think it is best to leave assertions enabled in production releases to facilitate problem diagnosis if needed.

I am open to creative ideas for how to test this, yet time box the effort invested.

If nothing else, the check could be incorporated into the codebase. It is important and quick to test.

@mbaumann-broad
Copy link
Collaborator Author

mbaumann-broad commented Nov 8, 2023

Regarding:

If there is a way to validate the project id using the terra DRSHUB backend, then many of the issues here and discussed below can be resolved. Currently, the project_id is "verified" when DRSHUB throws an error when attempting to download data without a valid project id. Not an ideal solution.

Good question. I expect there is some worthwhile validation that can be done early on. This will require some investigation, so I opened the following issue to look into it: #45

@matthewpeterkort
Copy link
Collaborator

Good question. Ideally, incorporating this check into the code base would not be necessary. Yet, the signed URL is not exposed outside of the drs_downloader to test it externally.

Currently the signed url is being logged to the drs_downloader.log file for debugging purposes. Should this be removed?

Could look into adding signed url checking/validation but is there any situation where an unruly/false signed url can be generated by DRSHUB without hitting some sort of exception message beforehand?

@matthewpeterkort
Copy link
Collaborator

Regarding:

If there is a way to validate the project id using the terra DRSHUB backend, then many of the issues here and discussed below can be resolved. Currently, the project_id is "verified" when DRSHUB throws an error when attempting to download data without a valid project id. Not an ideal solution.

Good question. I expect there is some worthwhile validation that can be done early on. This will require some investigation, so I opened the following issue to look into it: #45

Thank you for looking into this

@lbeckman314 lbeckman314 linked a pull request Nov 15, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 First scheduled issue(s) high priority Should be completed ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants