Adds Feature/manager #15

bwalsh · 2022-11-11T02:16:41Z

This PR:

refactors methods into a set of classes - see model clients and manager
adds simple cli commands to ensure that test fixtures run without error
includes a mock client that tests the manager, etc without requiring manifest or server

drs_downloader/cli.py

matthewpeterkort · 2022-11-11T23:07:10Z

So from what I can understand from this, this PR is meant to lay out a structure to the code so that it is easier modularization for moving between Terra, Gen3, and whatever other client. This new structure needs to be fleshed out and completed so that has more than just mock functionality

drs_downloader/cli.py

drs_downloader/manager.py

drs_downloader/cli.py

drs_downloader/manager.py

drs_downloader/clients/terra.py

drs_downloader/manager.py

.github/workflows/python-app.yml

# This is the 1st commit message: replaced pandas with csv, added optimizer, added download sort by file size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format # This is the commit message #2: Liam's working branch, machine transfer

…e size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format Update requirements

drs_downloader/cli.py

mbaumann-broad

A few initial comments ... many more to come.

mbaumann-broad · 2022-12-16T01:22:55Z

drs_downloader/manager.py

+        drs_object.file_parts = paths
+
+        i = 1
+        filename = f"{drs_object.name}"


According to the DRS specification, the name field is not required and may be null.

TNU uses the value of the name field if it is not null, and if it is null, it uses the last element of the key path (which is the actual filename) as can be seen here:
https://github.com/DataBiosphere/terra-notebook-utils/blob/b53bb8656d502ecbdbfe9c5edde3fa25bd90bbf8/terra_notebook_utils/drs.py#L248

Agreed. Will address today in a change to this PR, including test.

mbaumann-broad · 2022-12-16T01:54:14Z

drs_downloader/cli.py

+
+@cli.command()
+@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.")
+@click.option("--destination_dir", "-d", show_default=True, default='/tmp/testing',


I think the separator in option identifiers should be changed from _ to -.

I didn't see the specified one way or the other in the specification I looked at (may have overlooked it).

Yet, it is the common convention for all the CLI commands I recall using, for example, the GNU/Linux cp command (here).

Agreed. Will address today in a change to this PR, including test.

Fixed and tested

mbaumann-broad · 2022-12-16T01:56:20Z

drs_downloader/cli.py

+              help="Destination directory.")
+@click.option("--manifest_path", "-m", show_default=True,
+              help="Path to manifest tsv.")
+@click.option('--drs_header', default='ga4gh_drs_uri', show_default=True,


I propose changing --drs_header to --drs-column-name as I think it is clearer and easier to understand ("header" has multiple meanings when working with HTTP/DRS).

Agreed. Will address today in a change to this PR, including test.

Fixed and tested

mbaumann-broad · 2022-12-16T19:41:11Z

README.md

+. venv/bin/activate
+pip install -r requirements.txt -r requirements-dev.txt
+```
+


I think a step is missing here.
After running the commands above then running

pytest --cov=tests

pytest reported many errors.

I then ran the following from the top-level repo directory:

pip install -e .

then pytest ran much better.

I am not sure if pip install -e . is the right/best command to run here, yet it seems like there is a missing step in the insturctions.

Agreed. Will address today in a change to this PR, (change to README)

mbaumann-broad · 2022-12-16T19:44:03Z

drs_downloader/cli.py

+            logger.error((drs_object.name, 'ERROR', drs_object.size, len(drs_object.file_parts), drs_object.errors))
+            at_least_one_error = True
+    if at_least_one_error:
+        exit(99)


I recommend returning the value 1 in the case of general errors.
Although any positive exit code indicates errors, in my experience, the value 1 is most commonly used to indicate general failure.

Additional positive values may be used to indicate specific types of errors, yet IMO that is not needed for this tool.

Agreed. Will address today in a change to this PR, including test.

Fixed and tested

mbaumann-broad

Team, you have done a lot of great work on the drs_downloader!

Please address the comments and requested changes.
Many are straightforward and easily addressed, and some will require discussion/collaboration.

Thank you!

mbaumann-broad · 2022-12-18T17:01:33Z

drs_downloader/cli.py

+                "   or the URI header matches the uri header name in the TSV file that was specified")
+
+        for url in uris:
+            if '/' in url:


I propose using a stronger test, for example:
if url.startswith('drs://'):

Agreed. Will address today in a change to this PR, including test.

mbaumann-broad · 2022-12-18T17:11:30Z

drs_downloader/cli.py

+@cli.command()
+@click.option("--silent", "-s", is_flag=True, show_default=True, default=False, help="Display nothing.")
+@click.option("--destination_dir", "-d", show_default=True,
+              default="/tmp/testing", help="Destination directory.")


"/tmp/testing" is not an appropriate default dir.
The current working directory is probably and appropriate default.
The download directory should also be output for users to see before the download begins.

This comment applies to the other places the default destination appears, with the possible exception of the mock client.

Agreed. Will address today in a change to this PR, including test.

Fixed and tested

mbaumann-broad · 2022-12-18T17:34:38Z

drs_downloader/cli.py

+
+def _perform_downloads(destination_dir, drs_client, ids_from_manifest,  silent):
+    """Common helper method to run downloads."""
+    # verify parameters


Provide user output with the path to the destination directory.

@mbaumann-broad : just double checking on this one. Do you mean logging an INFO message with the full path name of the destination file?

@bwalsh
If the current user output looks like this:

$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA 100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s] 2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1) 2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1) 2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1) ...

It would be helpful to output the full path to the destination directory early on.
Something like this:

$ drs_downloader terra -m tests/fixtures/terra-data.tsv -d DATA Downloading to: /home/mbaumann/anvil/DATA 100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s] 2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1) 2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1) 2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1) ...

This helps confirm to users that the data is being downloaded to their intended destination, and (hopefully) help ensure enough space will be available on the storage volume for the download to complete successfully, etc.

Fixed and tested

@lbeckman314, two corrections needed:

The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.

The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the -d option

Please correct these.

mbaumann-broad · 2022-12-18T18:03:58Z

drs_downloader/clients/terra.py

+                    checksums=[Checksum(checksum=md5_, type='md5')],
+                    id=object_id,
+                    name=name_,
+                    access_methods=[AccessMethod(access_url="", type='gs')]


The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.

@mbaumann-broad : understand and agree. We do need to discuss / test how / where the downloader will make this decision. Creating issue. #18

mbaumann-broad · 2022-12-18T18:37:14Z

drs_downloader/manager.py

+
+        drs_objects = []
+
+        total_batches = len(object_ids) / self.max_simultaneous_object_retrievers


This code pattern to determine the number of batches based on a number of elements and batch size appears in multiple places in the drs_downloader code.
I suggest making a utility function that does that, and using it everywhere the total number of matches needs to be determined.
Or, just implement using a single line using math.ceil as is done elsewhere like this:

total=math.ceil(len(parts) / self.max_simultaneous_part_handlers)

Agreed. Will address with a test in this PR

mbaumann-broad · 2022-12-18T19:03:38Z

drs_downloader/clients/terra.py

+        """
+        data = {
+            "url": object_id,
+            "fields": ["fileName", "size", "hashes", "accessUrl"]


The drs_downloader code appears to get the "accessUrl"(/signed URL) again before actually downloading the data. If that is the case, the "accessUrl" should not be requested here in get_object.
This is because the other fields ("fileName", "size", "hashes") can be obtained from the first request to the DRS server, yet getting the "accessUrl" requires a second request to the DRS server and performing the relatively expensive operation of signing the URL.
If the "accessUrl" obtained here is never actually used, then you should just request the other fields here and not "accessUrl".

Agreed. Will address with a test in this PR

mbaumann-broad · 2022-12-18T19:07:03Z

drs_downloader/clients/terra.py

+                    raise Exception(
+                        f"A valid URL was not returned from the server.  Please check the access for {account}\n{resp}")
+                url_ = resp['accessUrl']['url']
+                drs_object.access_methods = [AccessMethod(access_url=url_, type='gs')]


As also noted in a subsequent comment ...
The signed URL will not necessarily be for "gs"/Google.
For AnVIL data it will be Google in early 2023, then transitioning to Azure.
Today, Kids First data and NCI CRDC Proteomics Data Commons (PDC) data, which can be accessed via Terra DRS, is on AWS.

Creating issue #18

mbaumann-broad · 2022-12-18T19:30:23Z

drs_downloader/manager.py

+
+        elif any(drs_object.size > (1 * GB) for drs_object in drs_objects):
+            self.max_simultaneous_part_handlers = 10
+            self.part_size = 10 * MB


The part_size for large files should be at least 64 MB and probably 128 MB (or even much larger).
Getm is throughly tested with large files and uses a default chunk size of 128 MB.
Also, FYI, for getm, the chunk size is the size of each read from the stream, it does not perform a new HTTP request for each chunk as drs_downloader does for each part.

Agreed. Will address with a test in this PR

Fixed and tested

mbaumann-broad · 2022-12-18T19:34:52Z

drs_downloader/manager.py

+            parts.append((start, size, ))
+
+        if len(parts) > 1000:
+            logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')


Even if the part size for large files is increased to 128 MB (as commented elsewhere), this code would log an error for any file over 128 GB. A substantial number of NIH genomic data files are in the 500 GB to 750 GB range, and downloading these should not result in an error.
I am inclined to drs_downloader should be tested on files up to 1 TB (not regularly, yet at enough to verify that it works successfully).

@mbaumann-broad Agreed re. testing over 1TB. Question re. this log message though. Should we not log anything? Log a warning instead?

Yes, logging as a warning is fine/good.
In generally, logging as warnings conditions that are unexpected/unusual yet not necessarily problematic makes sense.

mbaumann-broad · 2022-12-18T20:09:38Z

drs_downloader/manager.py

+            await f for f in asyncio.as_completed(tasks)
+        ]
+
+        # second, download the parts


I think for really large files (many tens or hundreds of GB), the signed URL will expire before all of the parts have at least started downloading. We can test this to verify my understanding based on reading the code, yet I am pretty sure that is true.

The TDR DRS signed URLs expire in 15 minutes, and Gen3 DRS signed URLs expire in 60 minutes, yet I am pretty sure that for file sizes in the range of hundreds of GB the signed URL expiration would occur even with the 60-minute timeout.

@mbaumann-broad Do we know what the http status will be in this scenario? Address in separate PR #19

- #15

Update tests based on Michael's review Fix build file

mbaumann-broad

Some comments on recent updates.
I'm still reviewing ...

mbaumann-broad · 2022-12-20T20:09:07Z

drs_downloader/cli.py

+
+def _perform_downloads(destination_dir, drs_client, ids_from_manifest,  silent):
+    """Common helper method to run downloads."""
+    # verify parameters


@lbeckman314, two corrections needed:

The intention was for the "Downloading to: ..." message to be displayed where the user can see it, not written to a log file.

The full/absolute path should be displayed, not the just a relative path as the user may have provided as the value of the -d option

Please correct these.

mbaumann-broad · 2022-12-20T20:19:34Z

drs_downloader/manager.py

+stdout_handler = logging.StreamHandler(sys.stdout)
+stdout_handler.setLevel(logging.DEBUG)
+
+file_handler = logging.FileHandler('logs.log')


Please use a more descriptive name for the log file, such as <program_name>.log
For example: drs_downloader.log

Fixed (drs_downloader.log)

Fixed and tested ("Downloading to...")

mbaumann-broad · 2022-12-20T20:21:02Z

README.md

@@ -108,7 +108,7 @@ Usage: drs_download terra [OPTIONS]

 Options:
  -s, --silent                Display nothing.
-  -d, --destination_dir TEXT  Destination directory.  [default: /tmp/testing]
+  -d, --destination_dir TEXT  Destination directory.  [default: os.getcwd()]


Many users will not know what os.getcwd() means.
Instead, please use:
[default: current directory]

mbaumann-broad · 2022-12-20T20:24:48Z

drs_downloader/clients/terra.py

+                    checksums=[Checksum(checksum=md5_, type='md5')],
+                    id=object_id,
+                    name=name_,
+                    access_methods=[AccessMethod(access_url="", type='gs')]


Is the following line still necessary here, given the "accessUrl" is no longer being retrieved at this point?

access_methods=[AccessMethod(access_url="", type='gs')]

mbaumann-broad

I finished looking over the recent updates and left some comments regarding these.

mbaumann-broad · 2022-12-20T20:32:16Z

drs_downloader/manager.py

+                ({len(parts)})')
+
+        paths = []
+        # TODO - tqdm ugly here?


Does the "tqdm ugly here?" refer to all the flashing displayed in the user's terminal as the parts are downloaded?
If so, that is indeed ugly - especially for small part sizes!
As it stands now, the behavior for small files and small part sizes is unbearable.
Perhaps write the part download status information to the log instead?

Keeping this may make more sense if the minimum part size was larger (e.g, 128-512mb).

Just created an issue for this #21

mbaumann-broad · 2022-12-20T20:50:30Z

drs_downloader/manager.py

+            parts.append((start, size, ))
+
+        if len(parts) > 1000:
+            logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')


Why is this still being logged at error level? We agreed this should be logged as a warning:

logger.error(f'tasks > 1000 {drs_object.name} has over 1000 parts, consider optimization. ({len(parts)})')

[skip ci]

bwalsh added 2 commits November 10, 2022 18:13

Adds manager

abdb193

Install the dev dependencies, test mock, skip terra test.

828de8c

bwalsh force-pushed the feature/manager branch from 69c76c5 to 828de8c Compare November 11, 2022 02:31

bwalsh requested review from lbeckman314 and matthewpeterkort November 11, 2022 02:34

Base automatically changed from feature/flake8 to main November 11, 2022 20:21

matthewpeterkort reviewed Nov 11, 2022

View reviewed changes

drs_downloader/cli.py Outdated Show resolved Hide resolved

matthewpeterkort reviewed Nov 11, 2022

View reviewed changes

drs_downloader/cli.py Outdated Show resolved Hide resolved

bwalsh added 2 commits November 16, 2022 11:28

Moves constant to proper location

2f2f371

Optimizes memory use

efef7b8

bwalsh commented Nov 18, 2022

View reviewed changes

drs_downloader/manager.py Outdated Show resolved Hide resolved

bwalsh commented Nov 18, 2022

View reviewed changes

drs_downloader/manager.py Outdated Show resolved Hide resolved

bwalsh commented Nov 18, 2022

View reviewed changes

drs_downloader/manager.py Outdated Show resolved Hide resolved

matthewpeterkort reviewed Nov 18, 2022

View reviewed changes

drs_downloader/cli.py Show resolved Hide resolved

matthewpeterkort reviewed Nov 18, 2022

View reviewed changes

drs_downloader/manager.py Outdated Show resolved Hide resolved

matthewpeterkort reviewed Nov 18, 2022

View reviewed changes

drs_downloader/clients/terra.py Show resolved Hide resolved

matthewpeterkort reviewed Nov 18, 2022

View reviewed changes

drs_downloader/manager.py Show resolved Hide resolved

lbeckman314 reviewed Nov 22, 2022

View reviewed changes

.github/workflows/python-app.yml Outdated Show resolved Hide resolved

lbeckman314 approved these changes Nov 22, 2022

View reviewed changes

.github/workflows/python-app.yml Outdated Show resolved Hide resolved

lbeckman314 force-pushed the feature/manager branch from 6000e16 to 71508d9 Compare November 23, 2022 01:39

replaced pandas with csv, added optimizer, added download sort by fil…

1d903b7

…e size, cleaned up math.ciel, removed some TQDMs that were not neccesary, restructured signing and downloading into a batching format Update requirements

lbeckman314 force-pushed the feature/manager branch from 71508d9 to 1d903b7 Compare November 23, 2022 02:05

bwalsh commented Dec 13, 2022

View reviewed changes

drs_downloader/cli.py Outdated Show resolved Hide resolved

bwalsh added 13 commits December 13, 2022 15:28

Cleans up code

111dcc9

Adds integration tests

1ae93c1

Adds gen3

db3949c

Adds testplan outline

aad0fa5

flake8

5bc84e9

Minor cleanup

e44745b

Adds check for correct object size; skip if errors.

4662a12

Adds failure tests to mock

526d3aa

Capture exceptions in drs_object.errors, close session

8d19c36

Uses logger instead of print

02d9135

Fix docstring

e4866ad

Flake8

40db04f

Speed up by checking only our code

4213252

mbaumann-broad requested changes Dec 16, 2022

View reviewed changes

mbaumann-broad requested changes Dec 18, 2022

View reviewed changes

lbeckman314 self-requested a review December 19, 2022 19:02

Addressed Michael's review comments resolved basic problems

c12b5bc

- #15

lbeckman314 force-pushed the feature/manager branch from 914c8a4 to 338943f Compare December 20, 2022 00:00

added optimizer part size test and terra default directory download test

25dfcb0

Update tests based on Michael's review Fix build file

lbeckman314 force-pushed the feature/manager branch from 34f905f to 25dfcb0 Compare December 20, 2022 00:23

lbeckman314 approved these changes Dec 20, 2022

View reviewed changes

mbaumann-broad requested changes Dec 20, 2022

View reviewed changes

lbeckman314 mentioned this pull request Dec 20, 2022

TQDM flashing behavior for small parts #21

Closed

lbeckman314 added 2 commits December 21, 2022 11:48

Output download destination to stdout

6410e24

Fix README example

eae8fec

[skip ci]

lbeckman314 merged commit 9b997d3 into main Dec 21, 2022

lbeckman314 deleted the feature/manager branch December 21, 2022 19:57


		drs_objects = []

		total_batches = len(object_ids) / self.max_simultaneous_object_retrievers

Adds Feature/manager #15

Adds Feature/manager #15

Conversation

bwalsh commented Nov 11, 2022 • edited Loading

matthewpeterkort commented Nov 11, 2022

mbaumann-broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbeckman314 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbaumann-broad Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbeckman314 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

mbaumann-broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbeckman314 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbaumann-broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbeckman314 Dec 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbaumann-broad left a comment

Choose a reason for hiding this comment

mbaumann-broad Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

lbeckman314 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwalsh commented Nov 11, 2022 •

edited

Loading

lbeckman314 Dec 20, 2022 •

edited

Loading

mbaumann-broad Dec 16, 2022 •

edited

Loading

lbeckman314 Dec 20, 2022 •

edited

Loading

lbeckman314 Dec 20, 2022 •

edited

Loading

lbeckman314 Dec 21, 2022 •

edited

Loading

mbaumann-broad Dec 20, 2022 •

edited

Loading

lbeckman314 Dec 20, 2022 •

edited

Loading