-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitchish mosaic run on its own branch #350
Conversation
--epoch=1 \ | ||
--optimizer.learning_rate=0.000023 \ | ||
--scheduler.t_warmup=556000 \ | ||
--scheduler.t_max=557000 \ | ||
--scheduler.alpha_f=0.001 \ | ||
--stop_at=557001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the last thing we ran to decay the LR down to 0. I think it's good to keep for reference but maybe I'll just comment it out instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -49,7 +49,7 @@ optimizer: | |||
metrics_log_interval: 10 | |||
|
|||
scheduler: | |||
name: cosine_with_warmup | |||
name: linear_with_warmup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was linear the whole time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess when we first made this config we were thinking cosine. We've only ran it with linear though.
@@ -430,7 +429,7 @@ def __init__(self, layer_id: int, config: ModelConfig, cache: BufferCache): | |||
self.__cache = cache | |||
assert config.d_model % config.n_heads == 0 | |||
|
|||
self._activation_checkpoint_fn = pass_through_fn | |||
self._activation_checkpoint_fn = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. Doesn't that mean it will compile only if we don't use checkpointing? As far as I know, compile never likes function pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or did compile + checkpointing never work anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we ever got compile to work with checkpointing, but I needed to make this change in order for compile to work without checkpointing.
See also #368.
To restart the MosaicML:
The is configured to automatically find the latest checkpoint on S3 and restart from there. The
--follow
is not necessary but it's a good idea to follow it until at least the first batch is processed successfully.Code changes:
It turns out I had to make some minor code changes without we handle no activation checkpointing in order for compile to continue to work with the latest changes.