Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite bash scripts into Python interfaces #2

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 58 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ pip3 install torch==2.1.2 torchvision torchaudio
```
cd LESS
pip install -r requirement.txt
or
poetry install
```

**Step 3**: Finally, install the `less` package in editable mode to make it accessible for your development environment:
Expand All @@ -44,75 +46,85 @@ We follow the [open-instruct](https://github.com/allenai/open-instruct?tab=readm
To enhance downstream performance from data selection, it's crucial to start with a warmup training step. This involves selecting a small portion of your entire dataset to train using the LoRA method. Follow these steps for effective warmup training:

```bash
DATA_DIR=../data
MODEL_PATH=meta-llama/Llama-2-7b-hf
PERCENTAGE=0.05 # percentage of the full data to train, you can specify the training file you want to use in the script
DATA_SEED=3
JOB_NAME=llama2-7b-p${PERCENTAGE}-lora-seed${DATA_SEED}

./less/scripts/train/warmup_lora_train.sh "$DATA_DIR" "$MODEL_PATH" "$PERCENTAGE" "$DATA_SEED" "$JOB_NAME"
python3 -m less.scripts.train.warmup_lora_train --train_file <str> --model_name_or_path <str>
```
NB: there are more optional arguments that you can use to alter the training process. Please refer to the script for more details.
You can also set `--percentage` to specify the percentage of data to train on (default is 0.05) and `--data_seed` to specify the seed for data selection (default is 3).
The checkpoint will be saved in the `out` directory.

### Step 2: Building the gradient datastore
Once the initial warmup training stage is completed, we will collect gradients for the entire training dataset. For each checkpoint, our goal is to obtain the gradients of all the training data that we would like to select from. An example script is shown below.

```bash
CKPT=105

TRAINING_DATA_NAME=dolly
TRAINING_DATA_FILE=../data/train/processed/dolly/dolly_data.jsonl # when changing data name, change the data path accordingly
GRADIENT_TYPE="adam"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TRAINING_DATA_NAME}-ckpt${CKPT}-${GRADIENT_TYPE}
DIMS="8192"

./less/scripts/get_info/grad/get_train_lora_grads.sh "$TRAINING_DATA_FILE" "$MODEL_PATH" "$OUTPUT_PATH" "$DIMS" "$GRADIENT_TYPE"
python3 -m less.scripts.get_info.grad.get_train_lora_grads \
--train_data_name <str> \
--train_file <str> \
--model_path <str> \
--ckpts <str> \
--dims <int>
```
Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.
Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.
`train_data_name` is the name of the training data, which will be used to store the gradients, it should be comprehensive for you to easily distignuish between different experiments.
`train_file` is the path to the training file.
`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.
`ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`. The paper recommends using all four checkpoints.
`dims` is the dimension of projection, default is 8192.

The gradients will be saved in the `grads` directory.

### Step 3: Selecting data for a task
To select data for a particular downstream task, it's necessary to first prepare data specific to that task, using the same instruction-tuning prompt format as was employed during training. We have set up data loading modules for three evaluation datasets featured in our work: BBH, TydiQA, and MMLU. If you're interested in data selection for additional tasks, you can expand the [`less/data_selection/get_validation_dataset.py`](less/data_selection/get_validation_dataset.py) script to accommodate those tasks. Similar to obtaining gradients for training data, run the following script. The primary difference is that this process will yield SGD gradients for the validation data, following the formulation of the influence estimation.

NB: for your custom datasets, you can provide a full path to the data or HF repo name in the DATA_DIR. Don't forget to adjust `less/data_selection/get_validation_dataset.py` to add your task name to the appropriate load method.
You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step.

```bash
python3 -m less.scripts.get_info.grad.get_eval_lora_grads \
--task <str> \
--data_dir <str> \
--val_task_load_method <str> \
--model_path <str> \
--ckpts <str> \
--dims <int>
```
`task` is the name of the task, which will be used to store the gradients.
`data_dir` is the path to the data directory. If you are using one of the predifined datasets ("bbh", "tydiqa", "mmlu"), this should point to the data directory. If you are using your own custom dataset, this should be a full path to a JSONL file or a HF repo name. We also expect that every custom dataset has a `content` column. If that's not the case, you can change the tokenization function in the `less/data_selection/get_validation_dataset.py` script to encode the data.
`val_task_load_method` is the method to load the validation data, can be `hf`, `local_hf`, `local_json`. You should specify this if you are using your own custom dataset. Default is `None`, then it's assumned that you are using the predifined datasets.
`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.
`ckpts` is the list of checkpoints to compute gradients for, e.g. `'105 211 317 420'`.
`dims` is the dimension of projection, default is 8192.

CKPT=105
TASK=tydiqa
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TASK}-ckpt${CKPT}-sgd # for validation data, we always use sgd
DATA_DIR=../data
DIMS="4096 8192" # We use 8192 as our default projection dimension
The gradients will be saved in the `grads` directory.

./less/scripts/get_info/grad/get_eval_lora_grads.sh "$TASK" "$DATA_DIR" "$MODEL_PATH" $OUTPUT_PATH "$DIMS"
```
You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step. After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.
After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.

```bash
DIM=8192 # decide which dimension to use
GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-adam/dim${DIM}
TRAIN_FILE_NAMES="flan_v2 cot dolly oasst1"
CKPTS="105 211 317 420" # checkpoing index
CHECKPOINT_WEIGHTS="1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06" # average lr of the epoch

VALIDATION_GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-sgd/dim${DIM}
TARGET_TASK_NAMES="tydiqa"
TARGET_TASK_FILES="..."
SELECTED_DATA_OUTPUT_PATH="../selected_data"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}

./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$TARGET_TASK_FILES" "$SELECTED_DATA_OUTPUT_PATH" "$MODEL_PATH"
python3 -m less.data_selection.matching \
--train_file_names <str> \
--ckpts <str> \
--dims <int> \
--checkpoint_weights <str> \
--target_task_names <str> \
--target_task_files <str> \
--val_task_load_method <str> \
--model_path <str>
```
`train_file_names` is a list of training data names that you used to store the gradients.
`ckpts` is a list of checkpoints, e.g. `'105 211 317 420'`.
`dims` is the dimension of projection, default is 8192.
`checkpoint_weights` is a list of average lr of the epoch (check in Wandb), e.g. `'1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06'`.
`target_task_names` is a list of target task names that you used to store the gradients.
`target_task_files` can be a full path or a HF repo name, don't forget to specify the `val_task_load_method` accordingly.
`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.

The influence score for each training data point will be saved in the `OUTPUT_PATH` directory. You can use the following script to select the top-k data points with the highest influence score.

```bash
python3 -m less.data_selection.write_selected_data \
--target_task_names ${TARGET_TASK_NAMES} \
--train_file_names ${TRAIN_FILE_NAMES} \
--train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
--output_path $SELECTED_DATA_OUTPUT_PATH \
--percentage 0.05
--target_task_names <str> \
--train_file_names <str> \
--train_files <str> \
--output_path <str> \
--percentage <float>
```

### Step 4: Train with your selected data
Expand Down
3 changes: 3 additions & 0 deletions less/data_selection/get_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ def load_adam_state(model, optimizer_state_path):
default="tulu", help="The chat format")
parser.add_argument("--use_chat_format", type=bool,
default=True, help="Whether to use chat format")
parser.add_argument("--val_task_load_method", type=str,
default=None, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'")
parser.add_argument("--max_length", type=int, default=2048,
help="The maximum length")
parser.add_argument("--zh", default=False, action="store_true",
Expand Down Expand Up @@ -169,6 +171,7 @@ def load_adam_state(model, optimizer_state_path):
if args.task is not None:
dataset = get_dataset(args.task,
data_dir=args.data_dir,
val_task_load_method=args.val_task_load_method,
tokenizer=tokenizer,
chat_format=args.chat_format,
use_chat_format=args.use_chat_format,
Expand Down
11 changes: 2 additions & 9 deletions less/data_selection/get_validation_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -447,22 +447,15 @@ def get_dataset(task, **kwargs):
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({"pad_token": "<pad>"})
kwargs["tokenizer"] = tokenizer
else:
raise ValueError("No tokenizer found")

if task == "bbh":
return get_bbh_dataset(**kwargs)
elif task == "tydiqa":
return get_tydiqa_dataset(**kwargs)
elif task == "mmlu":
return get_mmlu_dataset(**kwargs)

elif task in ["kstack_clean"]:
return get_custom_dataset(**kwargs, load_method="hf")
elif task in ["kstack"]:
return get_custom_dataset(**kwargs, load_method="local_json")
elif task in ["golden_repos", "lca_no_context"]:
return get_custom_dataset(**kwargs, load_method="local_hf")
elif "val_task_load_method" in kwargs and kwargs["val_task_load_method"] is not None:
return get_custom_dataset(**kwargs, load_method=kwargs["val_task_load_method"])
else:
raise ValueError("Invalid task name")

Expand Down
46 changes: 20 additions & 26 deletions less/data_selection/matching.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,17 @@

argparser = argparse.ArgumentParser(
description='Script for selecting the data for training')
argparser.add_argument('--gradient_path', type=str, default="{} ckpt{}",
help='The path to the gradient file')
argparser.add_argument('--train_file_names', type=str, nargs='+',
help='The name of the training file')
argparser.add_argument('--ckpts', type=int, nargs='+',
help="Checkpoint numbers.")
argparser.add_argument('--checkpoint_weights', type=float, nargs='+',
help="checkpoint weights")
argparser.add_argument('--ckpts', type=str, help="Checkpoints, e.g. '105 211 317 420'")
argparser.add_argument('--dims', default=8192, type=int, required=False, help='Dimention used for grads')
argparser.add_argument('--checkpoint_weights', type=str, help="Average lr of the epoch (check in wandb)")
argparser.add_argument('--target_task_names', type=str,
nargs='+', help="The name of the target tasks")
nargs='+', help="The name of the target task(s)")
argparser.add_argument('--target_task_files', type=str, nargs='+',
help='The name of the validation file')
argparser.add_argument('--validation_gradient_path', type=str,
default="{} ckpt{}", help='The path to the validation gradient file')
argparser.add_argument('--output_path', type=str, default="selected_data",
help='The path to the output')
argparser.add_argument('--model_path', type=str, default="Model path (to initialize a tokenizer)",
help='The path to the output')

help='Can be a full path or a HF repo name')
argparser.add_argument('--val_task_load_method', type=str, default=None, help='The method to load the validation data, can be "hf", "local_hf", "local_json"')
argparser.add_argument('--model_path', type=str, required=True, help='Model path, e.g. llama2-7b-p0.05-lora-seed3')

args = argparser.parse_args()

Expand All @@ -44,23 +36,26 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
training_info, validation_info.transpose(0, 1))
return influence_scores

checkpoints = [int(i) for i in args.ckpts.split(" ")]

checkpoint_weights = [float(i) for i in args.checkpoint_weights.split(" ")]
# renormalize the checkpoint weights
mishaevtikhiev marked this conversation as resolved.
Show resolved Hide resolved
if sum(args.checkpoint_weights) != 1:
s = sum(args.checkpoint_weights)
args.checkpoint_weights = [i/s for i in args.checkpoint_weights]
if sum(checkpoint_weights) != 1:
s = sum(checkpoint_weights)
args.checkpoint_weights = [i/s for i in checkpoint_weights]

# calculate the influence score for each validation task
for target_task_name, target_task_file in zip(args.target_task_names, args.target_task_files):
val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, model_path=args.model_path,)
model_path = os.path.join("../out", args.model_path, f"checkpoint-{checkpoints[0]}")
val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file,
model_path=model_path, val_task_load_method=args.val_task_load_method)
num_val_examples = len(val_dataset)
for train_file_name in args.train_file_names:
influence_score = 0
for i, ckpt in enumerate(args.ckpts):
for i, ckpt in enumerate(checkpoints):
# validation_path = args.validation_gradient_path.format(
# target_task_name, ckpt)
validation_path = args.validation_gradient_path.format(
target_task_name, ckpt)
validation_path = os.path.join("../grads", args.model_path, f"{target_task_name}-ckpt{ckpt}-sgd/dim{args.dims}")
if os.path.isdir(validation_path):
validation_path = os.path.join(validation_path, "all_orig.pt")
validation_info = torch.load(validation_path)
Expand All @@ -69,7 +64,7 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
validation_info = torch.tensor(validation_info)
validation_info = validation_info.to(device).half()
# gradient_path = args.gradient_path.format(train_file_name, ckpt)
gradient_path = args.gradient_path.format(train_file_name, ckpt)
gradient_path = os.path.join("../grads", args.model_path, f"{train_file_name}-ckpt{ckpt}-adam/dim{args.dims}")
if os.path.isdir(gradient_path):
gradient_path = os.path.join(gradient_path, "all_orig.pt")
training_info = torch.load(gradient_path)
Expand All @@ -84,10 +79,9 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
influence_score = influence_score.cpu().reshape(
influence_score.shape[0], num_val_examples, -1
).mean(-1).max(-1)[0]
output_dir = os.path.join(args.output_path, target_task_name)
output_dir = os.path.join("../selected_data", target_task_name)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_file = os.path.join(
args.output_path, target_task_name, f"{train_file_name}_influence_score.pt")
output_file = os.path.join(output_dir, f"{train_file_name}_influence_score.pt")
torch.save(influence_score, output_file)
print("Saved influence score to {}".format(output_file))
2 changes: 1 addition & 1 deletion less/data_selection/write_selected_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def parse_args():
argparser.add_argument('--target_task_names', type=str,
nargs='+', help='The name of the target task')
argparser.add_argument('--output_path', type=str,
default="selected_data", help='The path to the output')
default="../selected_data", help='The path to the output')
argparser.add_argument('--max_samples', type=int,
default=None, help='The maximum number of samples')
argparser.add_argument('--percentage', type=float, default=None,
Expand Down
42 changes: 42 additions & 0 deletions less/scripts/get_info/grad/get_eval_lora_grads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import os
import argparse
import subprocess

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--task", type=str, required=True, help="Task name (e.g. tydiqa, mmlu), will be used to store the gradients")
parser.add_argument("--data_dir", type=str, required=True, help="Path to data directory, can also be a full path or a HF repo name")
parser.add_argument("--val_task_load_method", default=None,type=str, required=False, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this one be required? Looks like with None the script will fail at the dataset loading stage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, with None it would fail (fixed it). I made it optional before because for hardcoded initial datasets like tydiqa we don't need to specify it, but realistically we will only run our own datasets, so it's required now

parser.add_argument("--model_path", type=str, required=True, help="Path to model in the `out` directory, e.g. 'llama2-7b-p0.05-lora-seed3'")
parser.add_argument("--ckpts", type=str, required=True, help="List of checkpoints to compute gradients for, e.g. '105 211 317 420'")
parser.add_argument("--dims", default=8192, type=int, required=False, help="Dimension of projection")
args = parser.parse_args()

# Split checkpoint string into list of ints
ckpts = [int(x) for x in args.ckpts.split()]

# Process each checkpoint
for ckpt in ckpts:
# Create output directory if it doesn't exist
model = os.path.join("../out", args.model_path, f"checkpoint-{ckpt}")

# Construct output path with checkpoint
output_path = os.path.join("../grads", args.model_path, f"{args.task}-ckpt{ckpt}-sgd")

# Build command
cmd = [
"python3", "-m", "less.data_selection.get_info",
"--task", args.task,
"--info_type", "grads",
"--model_path", model,
"--output_path", output_path,
"--gradient_projection_dimension", str(args.dims),
"--gradient_type", "sgd",
"--data_dir", args.data_dir,
"--val_task_load_method", args.val_task_load_method
]

subprocess.run(cmd)

if __name__ == "__main__":
main()
Loading