Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Feature/manager #15

Merged
merged 25 commits into from
Dec 21, 2022
Merged

Adds Feature/manager #15

merged 25 commits into from
Dec 21, 2022

Conversation

bwalsh
Copy link
Collaborator

@bwalsh bwalsh commented Nov 11, 2022

This PR:

  • refactors methods into a set of classes - see model clients and manager
  • adds simple cli commands to ensure that test fixtures run without error
  • includes a mock client that tests the manager, etc without requiring manifest or server

image

Base automatically changed from feature/flake8 to main November 11, 2022 20:21
@matthewpeterkort
Copy link
Collaborator

So from what I can understand from this, this PR is meant to lay out a structure to the code so that it is easier modularization for moving between Terra, Gen3, and whatever other client. This new structure needs to be fleshed out and completed so that has more than just mock functionality

drs_downloader/manager.py Outdated Show resolved Hide resolved
drs_downloader/manager.py Outdated Show resolved Hide resolved
drs_downloader/manager.py Outdated Show resolved Hide resolved
.github/workflows/python-app.yml Outdated Show resolved Hide resolved
# This is the 1st commit message:

replaced pandas with csv, added optimizer, added download sort by file size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format

# This is the commit message #2:

Liam's working branch, machine transfer
…e size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format

Update requirements
drs_downloader/cli.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mbaumann-broad mbaumann-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments ... many more to come.

drs_object.file_parts = paths

i = 1
filename = f"{drs_object.name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the DRS specification, the name field is not required and may be null.

TNU uses the value of the name field if it is not null, and if it is null, it uses the last element of the key path (which is the actual filename) as can be seen here:
https://github.com/DataBiosphere/terra-notebook-utils/blob/b53bb8656d502ecbdbfe9c5edde3fa25bd90bbf8/terra_notebook_utils/drs.py#L248

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.


@cli.command()
@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.")
@click.option("--destination_dir", "-d", show_default=True, default='/tmp/testing',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the separator in option identifiers should be changed from _ to -.

I didn't see the specified one way or the other in the specification I looked at (may have overlooked it).

Yet, it is the common convention for all the CLI commands I recall using, for example, the GNU/Linux cp command (here).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.

Copy link
Collaborator

@lbeckman314 lbeckman314 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help="Destination directory.")
@click.option("--manifest_path", "-m", show_default=True,
help="Path to manifest tsv.")
@click.option('--drs_header', default='ga4gh_drs_uri', show_default=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose changing --drs_header to --drs-column-name as I think it is clearer and easier to understand ("header" has multiple meanings when working with HTTP/DRS).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a step is missing here.
After running the commands above then running

pytest --cov=tests

pytest reported many errors.

I then ran the following from the top-level repo directory:

pip install -e .

then pytest ran much better.

I am not sure if pip install -e . is the right/best command to run here, yet it seems like there is a missing step in the insturctions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, (change to README)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.error((drs_object.name, 'ERROR', drs_object.size, len(drs_object.file_parts), drs_object.errors))
at_least_one_error = True
if at_least_one_error:
exit(99)
Copy link
Collaborator

@mbaumann-broad mbaumann-broad Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend returning the value 1 in the case of general errors.
Although any positive exit code indicates errors, in my experience, the value 1 is most commonly used to indicate general failure.

Additional positive values may be used to indicate specific types of errors, yet IMO that is not needed for this tool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.

Copy link
Collaborator

@lbeckman314 lbeckman314 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@mbaumann-broad mbaumann-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Team, you have done a lot of great work on the drs_downloader!

Please address the comments and requested changes.
Many are straightforward and easily addressed, and some will require discussion/collaboration.

Thank you!

" or the URI header matches the uri header name in the TSV file that was specified")

for url in uris:
if '/' in url:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose using a stronger test, for example:
if url.startswith('drs://'):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cli.command()
@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.")
@click.option("--destination_dir", "-d", show_default=True,
default="/tmp/testing", help="Destination directory.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"/tmp/testing" is not an appropriate default dir.
The current working directory is probably and appropriate default.
The download directory should also be output for users to see before the download begins.

This comment applies to the other places the default destination appears, with the possible exception of the mock client.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address today in a change to this PR, including test.

Copy link
Collaborator

@lbeckman314 lbeckman314 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def _perform_downloads(destination_dir, drs_client, ids_from_manifest, silent):
"""Common helper method to run downloads."""
# verify parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide user output with the path to the destination directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaumann-broad : just double checking on this one. Do you mean logging an INFO message with the full path name of the destination file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwalsh
If the current user output looks like this:

$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA
100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s]

2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1)
2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1)
2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1)
...

It would be helpful to output the full path to the destination directory early on.
Something like this:

$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA
Downloading to: /home/mbaumann/anvil/DATA
100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s]

2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1)
2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1)
2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1)
...

This helps confirm to users that the data is being downloaded to their intended destination, and (hopefully) help ensure enough space will be available on the storage volume for the download to complete successfully, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbeckman314, two corrections needed:

  • The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.
  • The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the -d option

Please correct these.

checksums=[Checksum(checksum=md5_, type='md5')],
id=object_id,
name=name_,
access_methods=[AccessMethod(access_url="", type='gs')]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaumann-broad : understand and agree. We do need to discuss / test how / where the downloader will make this decision. Creating issue. #18


drs_objects = []

total_batches = len(object_ids) / self.max_simultaneous_object_retrievers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code pattern to determine the number of batches based on a number of elements and batch size appears in multiple places in the drs_downloader code.
I suggest making a utility function that does that, and using it everywhere the total number of matches needs to be determined.
Or, just implement using a single line using math.ceil as is done elsewhere like this:

total=math.ceil(len(parts) / self.max_simultaneous_part_handlers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address with a test in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
data = {
"url": object_id,
"fields": ["fileName", "size", "hashes", "accessUrl"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drs_downloader code appears to get the "accessUrl"(/signed URL) again before actually downloading the data. If that is the case, the "accessUrl" should not be requested here in get_object.
This is because the other fields ("fileName", "size", "hashes") can be obtained from the first request to the DRS server, yet getting the "accessUrl" requires a second request to the DRS server and performing the relatively expensive operation of signing the URL.
If the "accessUrl" obtained here is never actually used, then you should just request the other fields here and not "accessUrl".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address with a test in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise Exception(
f"A valid URL was not returned from the server. Please check the access for {account}\n{resp}")
url_ = resp['accessUrl']['url']
drs_object.access_methods = [AccessMethod(access_url=url_, type='gs')]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As also noted in a subsequent comment ...
The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating issue #18


elif any(drs_object.size > (1 * GB) for drs_object in drs_objects):
self.max_simultaneous_part_handlers = 10
self.part_size = 10 * MB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part_size for large files should be at least 64 MB and probably 128 MB (or even much larger).
Getm is throughly tested with large files and uses a default chunk size of 128 MB.
Also, FYI, for getm, the chunk size is the size of each read from the stream, it does not perform a new HTTP request for each chunk as drs_downloader does for each part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will address with a test in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parts.append((start, size, ))

if len(parts) > 1000:
logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the part size for large files is increased to 128 MB (as commented elsewhere), this code would log an error for any file over 128 GB. A substantial number of NIH genomic data files are in the 500 GB to 750 GB range, and downloading these should not result in an error.
I am inclined to drs_downloader should be tested on files up to 1 TB (not regularly, yet at enough to verify that it works successfully).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaumann-broad Agreed re. testing over 1TB. Question re. this log message though. Should we not log anything? Log a warning instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, logging as a warning is fine/good.
In generally, logging as warnings conditions that are unexpected/unusual yet not necessarily problematic makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await f for f in asyncio.as_completed(tasks)
]

# second, download the parts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for really large files (many tens or hundreds of GB), the signed URL will expire before all of the parts have at least started downloading. We can test this to verify my understanding based on reading the code, yet I am pretty sure that is true.

The TDR DRS signed URLs expire in 15 minutes, and Gen3 DRS signed URLs expire in 60 minutes, yet I am pretty sure that for file sizes in the range of hundreds of GB the signed URL expiration would occur even with the 60-minute timeout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaumann-broad Do we know what the http status will be in this scenario? Address in separate PR #19

@lbeckman314 lbeckman314 self-requested a review December 19, 2022 19:02
Update tests based on Michael's review

Fix build file
Copy link
Collaborator

@mbaumann-broad mbaumann-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on recent updates.
I'm still reviewing ...


def _perform_downloads(destination_dir, drs_client, ids_from_manifest, silent):
"""Common helper method to run downloads."""
# verify parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbeckman314, two corrections needed:

  • The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.
  • The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the -d option

Please correct these.

stdout_handler = logging.StreamHandler(sys.stdout)
stdout_handler.setLevel(logging.DEBUG)

file_handler = logging.FileHandler('logs.log')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a more descriptive name for the log file, such as <program_name>.log
For example: drs_downloader.log

Copy link
Collaborator

@lbeckman314 lbeckman314 Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (drs_downloader.log)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and tested ("Downloading to...")

README.md Outdated
@@ -108,7 +108,7 @@ Usage: drs_download terra [OPTIONS]

Options:
-s, --silent Display nothing.
-d, --destination_dir TEXT Destination directory. [default: /tmp/testing]
-d, --destination_dir TEXT Destination directory. [default: os.getcwd()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many users will not know what os.getcwd() means.
Instead, please use:
[default: current directory]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksums=[Checksum(checksum=md5_, type='md5')],
id=object_id,
name=name_,
access_methods=[AccessMethod(access_url="", type='gs')]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following line still necessary here, given the "accessUrl" is no longer being retrieved at this point?

access_methods=[AccessMethod(access_url="", type='gs')]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@mbaumann-broad mbaumann-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished looking over the recent updates and left some comments regarding these.

({len(parts)})')

paths = []
# TODO - tqdm ugly here?
Copy link
Collaborator

@mbaumann-broad mbaumann-broad Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the "tqdm ugly here?" refer to all the flashing displayed in the user's terminal as the parts are downloaded?
If so, that is indeed ugly - especially for small part sizes!
As it stands now, the behavior for small files and small part sizes is unbearable.
Perhaps write the part download status information to the log instead?

Keeping this may make more sense if the minimum part size was larger (e.g, 128-512mb).

Copy link
Collaborator

@lbeckman314 lbeckman314 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just created an issue for this #21

parts.append((start, size, ))

if len(parts) > 1000:
logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this still being logged at error level? We agreed this should be logged as a warning:

logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbeckman314 lbeckman314 merged commit 9b997d3 into main Dec 21, 2022
@lbeckman314 lbeckman314 deleted the feature/manager branch December 21, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants