Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite bash scripts into Python interfaces #2

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

rinapch
Copy link

@rinapch rinapch commented Dec 9, 2024

Changelog:

  • Rewrite 4 main .sh scripts needed for experiment reproduction to Python
  • Switch project to poetry
  • Remove hardcode from get_dataset function, now you just need to provide the name and the load method

Questions for discussion:

  • The main role of the 2 bash scripts is to just set up some variables names and run a Python subprocess with the specified variables. We can just call the Python subprocess directly, however, with a trade-off that we will need to specify a lot of the variables by hand in the CLI
  • Should I also convert the training script for LoRA (the get the final model finetuned on selected data)? I didn't use as we have our own training script, but may be useful?

@rinapch rinapch changed the title [WIP] Rewrite bash scripts into Python interfaces Rewrite bash scripts into Python interfaces Dec 11, 2024
less/data_selection/matching.py Show resolved Hide resolved
parser = argparse.ArgumentParser()
parser.add_argument("--task", type=str, required=True, help="Task name (e.g. tydiqa, mmlu), will be used to store the gradients")
parser.add_argument("--data_dir", type=str, required=True, help="Path to data directory, can also be a full path or a HF repo name")
parser.add_argument("--val_task_load_method", default=None,type=str, required=False, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this one be required? Looks like with None the script will fail at the dataset loading stage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, with None it would fail (fixed it). I made it optional before because for hardcoded initial datasets like tydiqa we don't need to specify it, but realistically we will only run our own datasets, so it's required now

"--model_path", model_checkpoint_path,
"--output_path", output_path,
"--gradient_projection_dimension", str(args.dims),
"--gradient_type", "adam"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check question: Adam and AdamW provide the same type of gradient data, am I right?

Copy link
Author

@rinapch rinapch Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this repo in particular AdamW is used across all computations (despite this variable being just "Adam") -- it's used for warming up the model and for getting the gradients for the train data. For obtaining the gradients of eval data, gradient_type is set to "sgd"

Here is the difference in computation:

  • SGD
def obtain_gradients(model, batch):
    """ obtain gradients. """
    loss = model(**batch).loss
    loss.backward()
    vectorized_grads = torch.cat(
        [p.grad.view(-1) for p in model.parameters() if p.grad is not None])
    return vectorized_grads
  • Adam(W)
def prepare_optimizer_state(model, optimizer_state, device):
    names = [n for n, p in model.named_parameters() if p.requires_grad]
    avg = torch.cat([optimizer_state[n]["exp_avg"].view(-1) for n in names])
    avg_sq = torch.cat([optimizer_state[n]["exp_avg_sq"].view(-1)
                       for n in names])
    avg = avg.to(device)
    avg_sq = avg_sq.to(device)
    return avg, avg_sq


def obtain_gradients_with_adam(model, batch, avg, avg_sq):
    """ obtain gradients with adam optimizer states. """
    beta1 = 0.9
    beta2 = 0.999
    eps = 1e-08

    loss = model(**batch).loss
    loss.backward()

    vectorized_grads = torch.cat(
        [p.grad.view(-1) for n, p in model.named_parameters() if p.grad is not None])

    updated_avg = beta1 * avg + (1 - beta1) * vectorized_grads
    updated_avg_sq = beta2 * avg_sq + (1 - beta2) * vectorized_grads ** 2
    vectorized_grads = updated_avg / torch.sqrt(updated_avg_sq + eps)

    return vectorized_grads

Looking at this, I think in principle this code would work with a normal Adam, but the gradient values produces by Adam and AdamW would be different

]

# Add FSDP config for large models
if model_name_or_path == "meta-llama/Llama-2-13b-hf":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original code from their .sh script, right?

Copy link
Author

@rinapch rinapch Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems like models larger than llama-7B would not fit into their GPUs (relatable) so they needed to enable FSDP for larger experiments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants