-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't crash when multiple beams have identical peptide scores #306
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #306 +/- ##
=======================================
Coverage 89.77% 89.77%
=======================================
Files 12 12
Lines 929 929
=======================================
Hits 834 834
Misses 95 95 ☔ View full report in Codecov by Sentry. |
melihyilmaz
reviewed
Feb 26, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the removal of an existing unit test, looks good to me.
melihyilmaz
approved these changes
Feb 27, 2024
bittremieux
added a commit
that referenced
this pull request
May 14, 2024
* Remove `train_from_scratch` config option (#275) Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`. Fixes #263. * Stabilize torch.topk() behavior (#290) * Add epsilon to index zero * Fix typo * Use base PyTorch for repeating along the vocabulary size * Combine masking steps * Lint with updated black version * Lint test files * Add topk unit test * Fix lint * Add fixme comment for future * Update changelog * Generate new screengrabs with rich-codex --------- Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Rename max_iters to cosine_schedule_period_iters (#300) * Rename max_iters to cosine_schedule_period_iters * Add deprecated config option unit test * Fix missed rename * Proper linting * Remove unnecessary logging * Test that checkpoints with deprecated config options can be loaded * Minor change * Add test for fine-tuning with deprecated config options * Remove deprecated hyperparameters during model loading * Include deprecated hyperparameter warning * Test whether the warning is issued * Verify that the deprecated option is removed * Fix comments * Avoid defining deprecated options twice * Remap previous renamed config option `every_n_train_steps` * Update changelog --------- Co-authored-by: melihyilmaz <[email protected]> * Add FAQ entry about antibody sequencing * Don't crash when multiple beams have identical peptide scores (#306) * Test different beams with identical scores * Randomly break ties for beams with identical peptide score * Update changelog * Don't remove unit test * Allow csv to handle all newlines (#316) * Add 9-species model weights link to FAQ (#303) * Add model weights link * Generate new screengrabs with rich-codex * Clarify that these weights should only be used for benchmarking --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <[email protected]> * Add FAQ entry about antibody sequencing (#304) * Add FAQ entry about antibody sequencing * Generate new screengrabs with rich-codex --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Melih Yilmaz <[email protected]> * Allow csv to handle all newlines The `csv` module tries to handle newlines itself. On Windows, this leads to line endings of `\r\r\n` instead of `\r\n`. Setting `newline=''` produces the intended output on both platforms. * Update CHANGELOG.md * Fix linting issue * Delete docs/images/help.svg --------- Co-authored-by: Melih Yilmaz <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Stafford Noble <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> * Don't test on macOS versions with MPS (#327) * Prepare for release v4.2.0 * Update CHANGELOG.md (#332) --------- Co-authored-by: Melih Yilmaz <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: melihyilmaz <[email protected]> Co-authored-by: wsnoble <[email protected]> Co-authored-by: Joshua Klein <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #271.
The problem was that if there are different beams, with different predicted amino acid sequences (i.e. tokens at this phase in the code), but an identical peptide score, beam caching fails. This likely occurs when there are multiple predictions that can't be distinguished, and thus only occurs when using multiple beams for ambiguous spectra.
The exact failure was because the information that is cached for each beam is a tuple of (peptide score, array of amino acid scores, array of amino acid tokens). When comparing tuples, first the first element is used, in case of ties the second element is used, and so on. In this case, the second element is an array, which doesn't have an obvious "truthy" value, leading to the observed error. However, we actually don't even want to compare those arrays, but compare beams based on the peptide score only.
This is now addressed by adding a completely random float as the second element in those tuples, before the array of amino acid scores. It's vanishingly unlikely that those random numbers would ever be equal, so in effect this arbitrarily breaks ties in case of equal peptide scores.