Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download weight file from URL #349

Merged
merged 19 commits into from
Aug 2, 2024
Merged

Download weight file from URL #349

merged 19 commits into from
Aug 2, 2024

Conversation

Lilferrit
Copy link
Contributor

In order to address #108, the --model parameter now supports URLs. If the --model parameter is a valid URL the Casanovo CLI will first look in the cache for the weight file (cache resolution is done via hashing the URL), and if the weight file isn't cached the CLI will attempt to download the weight file and cache it. The CLI will raise an error if the --model parameter is set but it's not a valid URL or file in the local filesystem.

However this implementation does have an issue where if the weight file at a URL is updated, but the CLI has previously downloaded weights from that URL, the new weights won't be downloaded because there will still be a cache hit. To fix this I could introduce a --force option that skips the cache and always attempts to download and cache weights given a URL.

@Lilferrit Lilferrit linked an issue Jun 28, 2024 that may be closed by this pull request
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 91.80328% with 5 lines in your changes missing coverage. Please review.

Project coverage is 94.03%. Comparing base (276a50e) to head (c2bc971).

Files Patch % Lines
casanovo/casanovo.py 91.80% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #349      +/-   ##
==========================================
+ Coverage   89.98%   94.03%   +4.04%     
==========================================
  Files          12       12              
  Lines         979     1022      +43     
==========================================
+ Hits          881      961      +80     
+ Misses         98       61      -37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bittremieux bittremieux changed the base branch from main to dev June 28, 2024 08:13
@Lilferrit Lilferrit marked this pull request as ready for review July 2, 2024 21:06
Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments.

Two key things are still missing as well:

  1. Documentation of this functionality on readthedocs (maybe on the file formats page).
  2. Unit tests for the download behavior.

casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/casanovo.py Outdated Show resolved Hide resolved
@bittremieux
Copy link
Collaborator

However this implementation does have an issue where if the weight file at a URL is updated, but the CLI has previously downloaded weights from that URL, the new weights won't be downloaded because there will still be a cache hit.

You could also check the "last-modified" information for the URL and verify whether this is more recent than the download time or not.

@Lilferrit
Copy link
Contributor Author

I adjusted the hashing implementation such that now files downloaded from a user specified URL will now be downloaded to <cache_dir>/<url_hash>/<file_name>.ckpt (where filename is the actual filename in the url). Files downloaded automatically by _get_model_weights are still downloaded directly to the cache directory, the hash sub-dirs are only created when a custom url is specified.

@Lilferrit
Copy link
Contributor Author

A few small comments.

Two key things are still missing as well:

  1. Documentation of this functionality on readthedocs (maybe on the file formats page).
  2. Unit tests for the download behavior.

In regards to the readthedocs documentation it looks like weight files aren't mentioned at all in the "Input Files" section on the file formats page. Is this something we want to add?

@bittremieux
Copy link
Collaborator

Yes, this new (and also old) behavior of weights file downloading should still be added to the documentation.

Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's starting to look quite nice for a complex piece of code. A few more comments.

file_response = requests.head(file_url)
if file_response.ok:
if "Last-Modified" in file_response.headers:
url_last_modified = email.utils.parsedate_to_datetime(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I'm reluctant to include a library with rather different functionality for just a utility function (even though it's part of the standard libraries). Do we need extra functionality that strptime doesn't offer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I changed it to use strptime instead.

import logging
import urllib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I don't think this is necessary (rather the relevant import is urllib.parse a few lines further down).

if _is_valid_url(model):
model = _get_weights_from_url(model, cache_dir)
elif not Path(model).is_file():
raise ValueError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Add a similar logging statement.

@@ -11,6 +13,14 @@ When you're ready to use Casanovo for *de novo* peptide sequencing, you can inpu
All three of the above file formats can be used as input to Casanovo for *de novo* peptide sequencing.
As the official PSI standard format containing the complete information from a mass spectrometry run, mzML should typically be preferred.

### Model Weights

In addition to MS/MS spectra, Casanova also optionally accepts a model weights (.ckpt) input file when running in training, sequencing or evaluating mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Casanovo"

docs/file_formats.md Outdated Show resolved Hide resolved
tests/unit_tests/test_unit.py Show resolved Hide resolved
casanovo/casanovo.py Show resolved Hide resolved
casanovo/casanovo.py Show resolved Hide resolved
@Lilferrit Lilferrit requested a review from bittremieux July 30, 2024 16:04
Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice implementation and great work on the unit tests. 👏

@Lilferrit Lilferrit merged commit 9e3b630 into dev Aug 2, 2024
@Lilferrit Lilferrit deleted the download-weights branch August 2, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow specifying different model weights from GitHub Assets
2 participants