-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Feature/manager #15
Conversation
69c76c5
to
828de8c
Compare
So from what I can understand from this, this PR is meant to lay out a structure to the code so that it is easier modularization for moving between Terra, Gen3, and whatever other client. This new structure needs to be fleshed out and completed so that has more than just mock functionality |
# This is the 1st commit message: replaced pandas with csv, added optimizer, added download sort by file size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format # This is the commit message #2: Liam's working branch, machine transfer
6000e16
to
71508d9
Compare
…e size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format Update requirements
71508d9
to
1d903b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments ... many more to come.
drs_downloader/manager.py
Outdated
drs_object.file_parts = paths | ||
|
||
i = 1 | ||
filename = f"{drs_object.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the DRS specification, the name
field is not required and may be null.
TNU uses the value of the name
field if it is not null, and if it is null, it uses the last element of the key path (which is the actual filename) as can be seen here:
https://github.com/DataBiosphere/terra-notebook-utils/blob/b53bb8656d502ecbdbfe9c5edde3fa25bd90bbf8/terra_notebook_utils/drs.py#L248
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
drs_downloader/cli.py
Outdated
|
||
@cli.command() | ||
@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.") | ||
@click.option("--destination_dir", "-d", show_default=True, default='/tmp/testing', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the separator in option identifiers should be changed from _
to -
.
I didn't see the specified one way or the other in the specification I looked at (may have overlooked it).
Yet, it is the common convention for all the CLI commands I recall using, for example, the GNU/Linux cp
command (here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/cli.py
Outdated
help="Destination directory.") | ||
@click.option("--manifest_path", "-m", show_default=True, | ||
help="Path to manifest tsv.") | ||
@click.option('--drs_header', default='ga4gh_drs_uri', show_default=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose changing --drs_header
to --drs-column-name
as I think it is clearer and easier to understand ("header" has multiple meanings when working with HTTP/DRS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. venv/bin/activate | ||
pip install -r requirements.txt -r requirements-dev.txt | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a step is missing here.
After running the commands above then running
pytest --cov=tests
pytest reported many errors.
I then ran the following from the top-level repo directory:
pip install -e .
then pytest ran much better.
I am not sure if pip install -e .
is the right/best command to run here, yet it seems like there is a missing step in the insturctions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, (change to README)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/cli.py
Outdated
logger.error((drs_object.name, 'ERROR', drs_object.size, len(drs_object.file_parts), drs_object.errors)) | ||
at_least_one_error = True | ||
if at_least_one_error: | ||
exit(99) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend returning the value 1
in the case of general errors.
Although any positive exit code indicates errors, in my experience, the value 1
is most commonly used to indicate general failure.
Additional positive values may be used to indicate specific types of errors, yet IMO that is not needed for this tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Team, you have done a lot of great work on the drs_downloader
!
Please address the comments and requested changes.
Many are straightforward and easily addressed, and some will require discussion/collaboration.
Thank you!
drs_downloader/cli.py
Outdated
" or the URI header matches the uri header name in the TSV file that was specified") | ||
|
||
for url in uris: | ||
if '/' in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose using a stronger test, for example:
if url.startswith('drs://'):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/cli.py
Outdated
@cli.command() | ||
@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.") | ||
@click.option("--destination_dir", "-d", show_default=True, | ||
default="/tmp/testing", help="Destination directory.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"/tmp/testing" is not an appropriate default dir.
The current working directory is probably and appropriate default.
The download directory should also be output for users to see before the download begins.
This comment applies to the other places the default destination appears, with the possible exception of the mock client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address today in a change to this PR, including test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def _perform_downloads(destination_dir, drs_client, ids_from_manifest, silent): | ||
"""Common helper method to run downloads.""" | ||
# verify parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide user output with the path to the destination directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaumann-broad : just double checking on this one. Do you mean logging an INFO message with the full path name of the destination file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwalsh
If the current user output looks like this:
$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA
100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s]
2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1)
2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1)
2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1)
...
It would be helpful to output the full path to the destination directory early on.
Something like this:
$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA
Downloading to: /home/mbaumann/anvil/DATA
100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s]
2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1)
2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1)
2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1)
...
This helps confirm to users that the data is being downloaded to their intended destination, and (hopefully) help ensure enough space will be available on the storage volume for the download to complete successfully, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbeckman314, two corrections needed:
- The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.
- The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the
-d
option
Please correct these.
drs_downloader/clients/terra.py
Outdated
checksums=[Checksum(checksum=md5_, type='md5')], | ||
id=object_id, | ||
name=name_, | ||
access_methods=[AccessMethod(access_url="", type='gs')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaumann-broad : understand and agree. We do need to discuss / test how / where the downloader will make this decision. Creating issue. #18
drs_downloader/manager.py
Outdated
|
||
drs_objects = [] | ||
|
||
total_batches = len(object_ids) / self.max_simultaneous_object_retrievers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code pattern to determine the number of batches based on a number of elements and batch size appears in multiple places in the drs_downloader code.
I suggest making a utility function that does that, and using it everywhere the total number of matches needs to be determined.
Or, just implement using a single line using math.ceil as is done elsewhere like this:
total=math.ceil(len(parts) / self.max_simultaneous_part_handlers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address with a test in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/clients/terra.py
Outdated
""" | ||
data = { | ||
"url": object_id, | ||
"fields": ["fileName", "size", "hashes", "accessUrl"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drs_downloader code appears to get the "accessUrl"(/signed URL) again before actually downloading the data. If that is the case, the "accessUrl" should not be requested here in get_object
.
This is because the other fields ("fileName", "size", "hashes") can be obtained from the first request to the DRS server, yet getting the "accessUrl" requires a second request to the DRS server and performing the relatively expensive operation of signing the URL.
If the "accessUrl" obtained here is never actually used, then you should just request the other fields here and not "accessUrl".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address with a test in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise Exception( | ||
f"A valid URL was not returned from the server. Please check the access for {account}\n{resp}") | ||
url_ = resp['accessUrl']['url'] | ||
drs_object.access_methods = [AccessMethod(access_url=url_, type='gs')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As also noted in a subsequent comment ...
The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating issue #18
drs_downloader/manager.py
Outdated
|
||
elif any(drs_object.size > (1 * GB) for drs_object in drs_objects): | ||
self.max_simultaneous_part_handlers = 10 | ||
self.part_size = 10 * MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part_size for large files should be at least 64 MB and probably 128 MB (or even much larger).
Getm is throughly tested with large files and uses a default chunk size of 128 MB.
Also, FYI, for getm, the chunk size is the size of each read from the stream, it does not perform a new HTTP request for each chunk as drs_downloader does for each part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will address with a test in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/manager.py
Outdated
parts.append((start, size, )) | ||
|
||
if len(parts) > 1000: | ||
logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the part size for large files is increased to 128 MB (as commented elsewhere), this code would log an error for any file over 128 GB. A substantial number of NIH genomic data files are in the 500 GB to 750 GB range, and downloading these should not result in an error.
I am inclined to drs_downloader should be tested on files up to 1 TB (not regularly, yet at enough to verify that it works successfully).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaumann-broad Agreed re. testing over 1TB. Question re. this log message though. Should we not log anything? Log a warning instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, logging as a warning is fine/good.
In generally, logging as warnings conditions that are unexpected/unusual yet not necessarily problematic makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await f for f in asyncio.as_completed(tasks) | ||
] | ||
|
||
# second, download the parts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for really large files (many tens or hundreds of GB), the signed URL will expire before all of the parts have at least started downloading. We can test this to verify my understanding based on reading the code, yet I am pretty sure that is true.
The TDR DRS signed URLs expire in 15 minutes, and Gen3 DRS signed URLs expire in 60 minutes, yet I am pretty sure that for file sizes in the range of hundreds of GB the signed URL expiration would occur even with the 60-minute timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaumann-broad Do we know what the http status will be in this scenario? Address in separate PR #19
914c8a4
to
338943f
Compare
Update tests based on Michael's review Fix build file
34f905f
to
25dfcb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on recent updates.
I'm still reviewing ...
|
||
def _perform_downloads(destination_dir, drs_client, ids_from_manifest, silent): | ||
"""Common helper method to run downloads.""" | ||
# verify parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbeckman314, two corrections needed:
- The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.
- The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the
-d
option
Please correct these.
drs_downloader/manager.py
Outdated
stdout_handler = logging.StreamHandler(sys.stdout) | ||
stdout_handler.setLevel(logging.DEBUG) | ||
|
||
file_handler = logging.FileHandler('logs.log') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a more descriptive name for the log file, such as <program_name>.log
For example: drs_downloader.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (drs_downloader.log)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md
Outdated
@@ -108,7 +108,7 @@ Usage: drs_download terra [OPTIONS] | |||
|
|||
Options: | |||
-s, --silent Display nothing. | |||
-d, --destination_dir TEXT Destination directory. [default: /tmp/testing] | |||
-d, --destination_dir TEXT Destination directory. [default: os.getcwd()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many users will not know what os.getcwd()
means.
Instead, please use:
[default: current directory]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drs_downloader/clients/terra.py
Outdated
checksums=[Checksum(checksum=md5_, type='md5')], | ||
id=object_id, | ||
name=name_, | ||
access_methods=[AccessMethod(access_url="", type='gs')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the following line still necessary here, given the "accessUrl" is no longer being retrieved at this point?
access_methods=[AccessMethod(access_url="", type='gs')]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finished looking over the recent updates and left some comments regarding these.
({len(parts)})') | ||
|
||
paths = [] | ||
# TODO - tqdm ugly here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the "tqdm ugly here?" refer to all the flashing displayed in the user's terminal as the parts are downloaded?
If so, that is indeed ugly - especially for small part sizes!
As it stands now, the behavior for small files and small part sizes is unbearable.
Perhaps write the part download status information to the log instead?
Keeping this may make more sense if the minimum part size was larger (e.g, 128-512mb).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just created an issue for this #21
drs_downloader/manager.py
Outdated
parts.append((start, size, )) | ||
|
||
if len(parts) > 1000: | ||
logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this still being logged at error level? We agreed this should be logged as a warning:
logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR:
model
clients
andmanager
mock
client that tests the manager, etc without requiring manifest or server