Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite bash scripts into Python interfaces #2

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
README formatting
rinapch committed Dec 9, 2024
commit 7e3534e5295124b501f568f641e83e5fb4f77309
49 changes: 41 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -56,7 +56,12 @@ The checkpoint will be saved in the `out` directory.
Once the initial warmup training stage is completed, we will collect gradients for the entire training dataset. For each checkpoint, our goal is to obtain the gradients of all the training data that we would like to select from. An example script is shown below.

```bash
python3 -m less/scripts/get_info/grad/get_train_lora_grads --train_data_name <str> --train_file <str> --model_path <str> --ckpts <str> --dims <int>
python3 -m less/scripts/get_info/grad/get_train_lora_grads \
--train_data_name <str> \
--train_file <str> \
--model_path <str> \
--ckpts <str> \
--dims <int>
```
Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.
`train_data_name` is the name of the training data, which will be used to store the gradients, it should be comprehensive for you to easily distignuish between different experiments.
@@ -65,21 +70,31 @@ Ideally, you would aim to create a datastore that encompasses a gradient of all
`ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`. The paper recommends using all four checkpoints.
`dims` is the dimension of projection, default is 8192.

The gradients will be saved in the `grads` directory.

### Step 3: Selecting data for a task
To select data for a particular downstream task, it's necessary to first prepare data specific to that task, using the same instruction-tuning prompt format as was employed during training. We have set up data loading modules for three evaluation datasets featured in our work: BBH, TydiQA, and MMLU. If you're interested in data selection for additional tasks, you can expand the [`less/data_selection/get_validation_dataset.py`](less/data_selection/get_validation_dataset.py) script to accommodate those tasks. Similar to obtaining gradients for training data, run the following script. The primary difference is that this process will yield SGD gradients for the validation data, following the formulation of the influence estimation.

You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step.

```bash
python3 -m less/scripts/get_info/grad/get_eval_lora_grads --task <str> --data_dir <str> --val_task_load_method <str> --model_path <str> --ckpts <str> --dims <int>
python3 -m less/scripts/get_info/grad/get_eval_lora_grads \
--task <str> \
--data_dir <str> \
--val_task_load_method <str> \
--model_path <str> \
--ckpts <str> \
--dims <int>
```
`task` is the name of the task, which will be used to store the gradients.
`data_dir` is the path to the data directory. If you are using one of the predifined datasets ("bbh", "tydiqa", "mmlu"), this should point to the data directory. If you are using your own custom dataset, this should be a full path to a JSONL file or a HF repo name.
`val_task_load_method` is the method to load the validation data, can be `hf`, `local_hf`, `local_json`. You should specify this if you are using your own custom dataset. Default is `None`, then it's assumned that you are using the predifined datasets.
`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.
`ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`.
`ckpts` is the list of checkpoints to compute gradients for, e.g. `'105 211 317 420'`.
`dims` is the dimension of projection, default is 8192.

The gradients will be saved in the `grads` directory.

After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.

```bash
@@ -96,17 +111,35 @@ SELECTED_DATA_OUTPUT_PATH="../selected_data"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}

./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$TARGET_TASK_FILES" "$SELECTED_DATA_OUTPUT_PATH" "$MODEL_PATH"

python3 -m less.data_selection.matching \
--train_file_names <str> \
--ckpts <str> \
--dims <int> \
--checkpoint_weights <str> \
--target_task_names <str> \
--target_task_files <str> \
--val_task_load_method <str> \
--model_path <str>

```
`train_file_names` is a list of training data names that you used to store the gradients.
`ckpts` is a list of checkpoints, e.g. `'105 211 317 420'`.
`dims` is the dimension of projection, default is 8192.
`checkpoint_weights` is a list of average lr of the epoch (check in Wandb), e.g. `'1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06'`.
`target_task_names` is a list of target task names that you used to store the gradients.
`target_task_files` can be a full path or a HF repo name, don't forget to specify the `val_task_load_method` accordingly.
`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.

The influence score for each training data point will be saved in the `OUTPUT_PATH` directory. You can use the following script to select the top-k data points with the highest influence score.

```bash
python3 -m less.data_selection.write_selected_data \
--target_task_names ${TARGET_TASK_NAMES} \
--train_file_names ${TRAIN_FILE_NAMES} \
--train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
--output_path $SELECTED_DATA_OUTPUT_PATH \
--percentage 0.05
--target_task_names <str> \
--train_file_names <str> \
--train_files <str> \
--output_path <str> \
--percentage <float>
```

### Step 4: Train with your selected data
41 changes: 16 additions & 25 deletions less/data_selection/matching.py
Original file line number Diff line number Diff line change
@@ -7,25 +7,17 @@

argparser = argparse.ArgumentParser(
description='Script for selecting the data for training')
argparser.add_argument('--gradient_path', type=str, default="{} ckpt{}",
help='The path to the gradient file')
argparser.add_argument('--train_file_names', type=str, nargs='+',
argparser.add_argument('--train_file_name', type=str, nargs='+',
help='The name of the training file')
argparser.add_argument('--ckpts', type=int, nargs='+',
help="Checkpoint numbers.")
argparser.add_argument('--checkpoint_weights', type=float, nargs='+',
help="checkpoint weights")
argparser.add_argument('--target_task_names', type=str,
nargs='+', help="The name of the target tasks")
argparser.add_argument('--target_task_files', type=str, nargs='+',
help='The name of the validation file')
argparser.add_argument('--validation_gradient_path', type=str,
default="{} ckpt{}", help='The path to the validation gradient file')
argparser.add_argument('--output_path', type=str, default="selected_data",
help='The path to the output')
argparser.add_argument('--model_path', type=str, default="Model path (to initialize a tokenizer)",
help='The path to the output')

argparser.add_argument('--ckpts', type=int, nargs='+', help="Checkpoints, e.g. '105 211 317 420'")
argparser.add_argument('--dims', default=8192, type=int, required=False, help='Dimention used for grads')
argparser.add_argument('--checkpoint_weights', type=float, nargs='+', help="Average lr of the epoch (check in wandb)")
argparser.add_argument('--target_task_name', type=str,
nargs='+', help="The name of the target task")
argparser.add_argument('--target_task_file', type=str, nargs='+',
help='Can be a full path or a HF repo name')
argparser.add_argument('--val_task_load_method', type=str, default=None, help='The method to load the validation data, can be "hf", "local_hf", "local_json"')
argparser.add_argument('--model_path', type=str, required=True, help='Model path, e.g. llama2-7b-p0.05-lora-seed3')

args = argparser.parse_args()

@@ -52,15 +44,15 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc

# calculate the influence score for each validation task
for target_task_name, target_task_file in zip(args.target_task_names, args.target_task_files):
val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, model_path=args.model_path,)
val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file,
model_path=args.model_path, val_task_load_method=args.val_task_load_method)
num_val_examples = len(val_dataset)
for train_file_name in args.train_file_names:
influence_score = 0
for i, ckpt in enumerate(args.ckpts):
# validation_path = args.validation_gradient_path.format(
# target_task_name, ckpt)
validation_path = args.validation_gradient_path.format(
target_task_name, ckpt)
validation_path = os.path.join("../grads", args.model_path, f"{train_file_name}_ckpt{ckpt}_sgd/dim{args.dims}")
if os.path.isdir(validation_path):
validation_path = os.path.join(validation_path, "all_orig.pt")
validation_info = torch.load(validation_path)
@@ -69,7 +61,7 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
validation_info = torch.tensor(validation_info)
validation_info = validation_info.to(device).half()
# gradient_path = args.gradient_path.format(train_file_name, ckpt)
gradient_path = args.gradient_path.format(train_file_name, ckpt)
gradient_path = os.path.join("../grads", args.model_path, f"{train_file_name}_ckpt{ckpt}_adam/dim{args.dims}")
if os.path.isdir(gradient_path):
gradient_path = os.path.join(gradient_path, "all_orig.pt")
training_info = torch.load(gradient_path)
@@ -84,10 +76,9 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
influence_score = influence_score.cpu().reshape(
influence_score.shape[0], num_val_examples, -1
).mean(-1).max(-1)[0]
output_dir = os.path.join(args.output_path, target_task_name)
output_dir = os.path.join("../selected_data", target_task_name)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_file = os.path.join(
args.output_path, target_task_name, f"{train_file_name}_influence_score.pt")
output_file = os.path.join(output_dir, f"{train_file_name}_influence_score.pt")
torch.save(influence_score, output_file)
print("Saved influence score to {}".format(output_file))