313 save final model #340

Lilferrit · 2024-06-20T18:48:07Z

In order to insure the final model is always saved the following lines were added to the end of ModelRunner.train:

# Always save final model weights at the end of training
if self.config.model_save_folder_path is not None:
    self.trainer.save_checkpoint(
        os.path.join(
            self.config.model_save_folder_path,
            "train-run-final.ckpt"
        )
    )

This implementation was tested using a small training run for the case where val_check_interval is not a factor of the total number of training steps and the case where val_check_interval is greater than the total number of training steps. In both cases the final model checkpoints were saved.

…ead of callback

Lilferrit · 2024-06-20T19:09:35Z

Added final epoch number to file name - changed implementation to:

# Always save final model weights at the end of training
if self.config.model_save_folder_path is not None:
    self.trainer.save_checkpoint(
        os.path.join(
            self.config.model_save_folder_path,
            f"train-run-final-{self.trainer.current_epoch}.ckpt"
        )
    )

bittremieux

Some minor comments to address.

However, can't we get the same results much easier by setting enable_checkpointing always to True when creating the Trainer and adding a default ModelCheckPoint callback? See the Trainer documentation. That way we can benefit from letting Lightning handle all of this.

casanovo/denovo/model_runner.py

Lilferrit · 2024-06-25T19:17:26Z

Some minor comments to address.

However, can't we get the same results much easier by setting enable_checkpointing always to True when creating the Trainer and adding a default ModelCheckPoint callback? See the Trainer documentation. That way we can benefit from letting Lightning handle all of this.

Agreed - I've reimplemented this using a ModelCheckpoint instead. I will push it upstream once I've done some more testing on my end. However regarding the enable_checkpointing operation it looks like this option only adds a default callback if no user defined callbacks are added to callbacks, so effectively this would do nothing if the validation ModelCheckpoint is added. In order to always save the final model I instead added a new ModelCheckpoint that fires at the end of every training epoch.

Lilferrit · 2024-06-25T20:54:17Z

Reimplemented using ModelCheckpoint, the last lines of the ModelRunner constructor now are:

# Configure checkpoints.
self.callbacks = [
    ModelCheckpoint(
        dirpath=config.model_save_folder_path,
        save_on_train_epoch_end=True,
    )
]

if config.save_top_k is not None:
    self.callbacks.append(
        ModelCheckpoint(
            dirpath=config.model_save_folder_path,
            monitor="valid_CELoss",
            mode="min",
            save_top_k=config.save_top_k,
        )
    )

bittremieux

Great, I think that this is a better solution.

The remaining thing to do is add unit tests that verify that the final model is saved for different situations (different values of steps and epochs). You could also add some tests to check that the periodic checkpoints are properly created as well.

codecov · 2024-06-26T19:13:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.88%. Comparing base (70ea9fc) to head (a743dc5).

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #340      +/-   ##
==========================================
+ Coverage   89.77%   89.88%   +0.10%     
==========================================
  Files          12       12              
  Lines         929      929              
==========================================
+ Hits          834      835       +1     
+ Misses         95       94       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Lilferrit · 2024-06-26T19:14:48Z

Great, I think that this is a better solution.

The remaining thing to do is add unit tests that verify that the final model is saved for different situations (different values of steps and epochs). You could also add some tests to check that the periodic checkpoints are properly created as well.

Sounds good, I added some unit test that ensure the last model weights are saved in the scenarios where val_check_Interval is greater than and not a factor of the number of training steps. Unfortunately since the ModelCheckpoint that saves the model checkpoints at the end of evert epoch deletes the last training epoch's checkpoints (it doesn't touch the validation checkpoints) when a new epoch checkpoint is saved and the CLIRunner.invoke is blocking I couldn't think of a practical way to test whether the periodic checkpoints are saved properly.

bittremieux

A few suggestions for the tests.

tests/conftest.py

tests/test_integration.py

Lilferrit · 2024-06-28T17:54:21Z

Sounds good, I factored the save final model test into a separate unit test test_save_final_model in test_runner.py.

bittremieux · 2024-06-28T18:22:22Z

Great! The final thing to do is update the changelog.

Lilferrit · 2024-06-28T19:42:11Z

Sounds great, I added an entry to the changelog.

Lilferrit added 4 commits June 20, 2024 10:31

implemented save last checkpoint

2745865

implemented last checkpoint saving using trainer save_checkpoint inst…

c58b661

…ead of callback

final checkpoint file name

3f033b2

added final epoch number to final checkpoint name

8e292b1

linter rules

b57ea7d

wsnoble requested a review from bittremieux June 25, 2024 16:32

bittremieux requested changes Jun 25, 2024

View reviewed changes

casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved

casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved

casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved

save final model using ModelCheckpoint callback

a034f1f

bittremieux requested changes Jun 26, 2024

View reviewed changes

bittremieux linked an issue Jun 26, 2024 that may be closed by this pull request

Save final model #313

Closed

Lilferrit added 2 commits June 26, 2024 11:34

implemented save model unit test

5c4ba09

fixed test_runner save checkpoints to working directory bug

3ee349b

bittremieux requested changes Jun 28, 2024

View reviewed changes

save final model test refactor

7db5c04

bittremieux approved these changes Jun 28, 2024

View reviewed changes

Lilferrit added 2 commits June 28, 2024 12:34

Merge branch 'dev' into 313-save-final-model

0247534

save final model changelog entry

a743dc5

Lilferrit merged commit 7372eb0 into dev Jun 28, 2024
6 checks passed

Lilferrit deleted the 313-save-final-model branch June 28, 2024 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

313 save final model #340

313 save final model #340

Lilferrit commented Jun 20, 2024

Lilferrit commented Jun 20, 2024

bittremieux left a comment

Lilferrit commented Jun 25, 2024

Lilferrit commented Jun 25, 2024

bittremieux left a comment •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

Lilferrit commented Jun 26, 2024

bittremieux left a comment

Lilferrit commented Jun 28, 2024

bittremieux commented Jun 28, 2024

Lilferrit commented Jun 28, 2024

313 save final model #340

313 save final model #340

Conversation

Lilferrit commented Jun 20, 2024

Lilferrit commented Jun 20, 2024

bittremieux left a comment

Choose a reason for hiding this comment

Lilferrit commented Jun 25, 2024

Lilferrit commented Jun 25, 2024

bittremieux left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

Lilferrit commented Jun 26, 2024

bittremieux left a comment

Choose a reason for hiding this comment

Lilferrit commented Jun 28, 2024

bittremieux commented Jun 28, 2024

Lilferrit commented Jun 28, 2024

bittremieux left a comment •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading