-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite bash scripts into Python interfaces #2
base: main
Are you sure you want to change the base?
Conversation
parser = argparse.ArgumentParser() | ||
parser.add_argument("--task", type=str, required=True, help="Task name (e.g. tydiqa, mmlu), will be used to store the gradients") | ||
parser.add_argument("--data_dir", type=str, required=True, help="Path to data directory, can also be a full path or a HF repo name") | ||
parser.add_argument("--val_task_load_method", default=None,type=str, required=False, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this one be required? Looks like with None the script will fail at the dataset loading stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, with None it would fail (fixed it). I made it optional before because for hardcoded initial datasets like tydiqa we don't need to specify it, but realistically we will only run our own datasets, so it's required now
"--model_path", model_checkpoint_path, | ||
"--output_path", output_path, | ||
"--gradient_projection_dimension", str(args.dims), | ||
"--gradient_type", "adam" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity check question: Adam and AdamW provide the same type of gradient data, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this repo in particular AdamW is used across all computations (despite this variable being just "Adam") -- it's used for warming up the model and for getting the gradients for the train data. For obtaining the gradients of eval data, gradient_type is set to "sgd"
Here is the difference in computation:
- SGD
def obtain_gradients(model, batch):
""" obtain gradients. """
loss = model(**batch).loss
loss.backward()
vectorized_grads = torch.cat(
[p.grad.view(-1) for p in model.parameters() if p.grad is not None])
return vectorized_grads
- Adam(W)
def prepare_optimizer_state(model, optimizer_state, device):
names = [n for n, p in model.named_parameters() if p.requires_grad]
avg = torch.cat([optimizer_state[n]["exp_avg"].view(-1) for n in names])
avg_sq = torch.cat([optimizer_state[n]["exp_avg_sq"].view(-1)
for n in names])
avg = avg.to(device)
avg_sq = avg_sq.to(device)
return avg, avg_sq
def obtain_gradients_with_adam(model, batch, avg, avg_sq):
""" obtain gradients with adam optimizer states. """
beta1 = 0.9
beta2 = 0.999
eps = 1e-08
loss = model(**batch).loss
loss.backward()
vectorized_grads = torch.cat(
[p.grad.view(-1) for n, p in model.named_parameters() if p.grad is not None])
updated_avg = beta1 * avg + (1 - beta1) * vectorized_grads
updated_avg_sq = beta2 * avg_sq + (1 - beta2) * vectorized_grads ** 2
vectorized_grads = updated_avg / torch.sqrt(updated_avg_sq + eps)
return vectorized_grads
Looking at this, I think in principle this code would work with a normal Adam, but the gradient values produces by Adam and AdamW would be different
] | ||
|
||
# Add FSDP config for large models | ||
if model_name_or_path == "meta-llama/Llama-2-13b-hf": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original code from their .sh script, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems like models larger than llama-7B would not fit into their GPUs (relatable) so they needed to enable FSDP for larger experiments
Changelog:
Questions for discussion: