JetBrains-Research · rinapch · Dec 7, 2024 · Dec 7, 2024 · Dec 7, 2024 · Dec 9, 2024
diff --git a/README.md b/README.md
@@ -27,6 +27,8 @@ pip3 install torch==2.1.2 torchvision torchaudio
 ```
 cd LESS
 pip install -r requirement.txt
+or 
+poetry install
 ```
 
 **Step 3**: Finally, install the `less` package in editable mode to make it accessible for your development environment:
@@ -44,75 +46,85 @@ We follow the [open-instruct](https://github.com/allenai/open-instruct?tab=readm
 To enhance downstream performance from data selection, it's crucial to start with a warmup training step. This involves selecting a small portion of your entire dataset to train using the LoRA method. Follow these steps for effective warmup training:
 
 ```bash 
-DATA_DIR=../data
-MODEL_PATH=meta-llama/Llama-2-7b-hf
-PERCENTAGE=0.05 # percentage of the full data to train, you can specify the training file you want to use in the script
-DATA_SEED=3
-JOB_NAME=llama2-7b-p${PERCENTAGE}-lora-seed${DATA_SEED}
-
-./less/scripts/train/warmup_lora_train.sh "$DATA_DIR" "$MODEL_PATH" "$PERCENTAGE" "$DATA_SEED" "$JOB_NAME"
+python3 -m less.scripts.train.warmup_lora_train --train_file <str> --model_name_or_path <str>
 ```
+NB: there are more optional arguments that you can use to alter the training process. Please refer to the script for more details.
+You can also set `--percentage` to specify the percentage of data to train on (default is 0.05) and `--data_seed` to specify the seed for data selection (default is 3).
+The checkpoint will be saved in the `out` directory.
 
 ### Step 2: Building the gradient datastore
 Once the initial warmup training stage is completed, we will collect gradients for the entire training dataset. For each checkpoint, our goal is to obtain the gradients of all the training data that we would like to select from. An example script is shown below.
 
 ```bash
-CKPT=105
-
-TRAINING_DATA_NAME=dolly
-TRAINING_DATA_FILE=../data/train/processed/dolly/dolly_data.jsonl # when changing data name, change the data path accordingly
-GRADIENT_TYPE="adam"
-MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
-OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TRAINING_DATA_NAME}-ckpt${CKPT}-${GRADIENT_TYPE}
-DIMS="8192"
-
-./less/scripts/get_info/grad/get_train_lora_grads.sh "$TRAINING_DATA_FILE" "$MODEL_PATH" "$OUTPUT_PATH" "$DIMS" "$GRADIENT_TYPE"
+python3 -m less.scripts.get_info.grad.get_train_lora_grads \
+  --train_data_name <str> \
+  --train_file <str> \
+  --model_path <str> \
+  --ckpts <str> \
+  --dims <int>
 ```
-Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose. 
+Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.  
+`train_data_name` is the name of the training data, which will be used to store the gradients, it should be comprehensive for you to easily distignuish between different experiments.  
+`train_file` is the path to the training file.  
+`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.  
+`ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`. The paper recommends using all four checkpoints.  
+`dims` is the dimension of projection, default is 8192.
+
+The gradients will be saved in the `grads` directory.
 
 ### Step 3: Selecting data for a task
 To select data for a particular downstream task, it's necessary to first prepare data specific to that task, using the same instruction-tuning prompt format as was employed during training. We have set up data loading modules for three evaluation datasets featured in our work: BBH, TydiQA, and MMLU. If you're interested in data selection for additional tasks, you can expand the [`less/data_selection/get_validation_dataset.py`](less/data_selection/get_validation_dataset.py) script to accommodate those tasks. Similar to obtaining gradients for training data, run the following script. The primary difference is that this process will yield SGD gradients for the validation data, following the formulation of the influence estimation. 
 
-NB: for your custom datasets, you can provide a full path to the data or HF repo name in the DATA_DIR. Don't forget to adjust `less/data_selection/get_validation_dataset.py` to add your task name to the appropriate load method.
+You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step. 
 
 ```bash
+python3 -m less.scripts.get_info.grad.get_eval_lora_grads \
+  --task <str> \
+  --data_dir <str> \
+  --val_task_load_method <str> \
+  --model_path <str> \
+  --ckpts <str> \
+  --dims <int>
+```
+`task` is the name of the task, which will be used to store the gradients.  
+`data_dir` is the path to the data directory. If you are using one of the predifined datasets ("bbh", "tydiqa", "mmlu"), this should point to the data directory. If you are using your own custom dataset, this should be a full path to a JSONL file or a HF repo name. We also expect that every custom dataset has a `content` column.  If that's not the case, you can change the tokenization function in the `less/data_selection/get_validation_dataset.py` script to encode the data.  
+`val_task_load_method` is the method to load the validation data, can be `hf`, `local_hf`, `local_json`. You should specify this if you are using your own custom dataset. Default is `None`, then it's assumned that you are using the predifined datasets.  
+`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.  
+`ckpts` is the list of checkpoints to compute gradients for, e.g. `'105 211 317 420'`.  
+`dims` is the dimension of projection, default is 8192.
 
-CKPT=105
-TASK=tydiqa
-MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
-OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TASK}-ckpt${CKPT}-sgd # for validation data, we always use sgd
-DATA_DIR=../data
-DIMS="4096 8192" # We use 8192 as our default projection dimension 
+The gradients will be saved in the `grads` directory.
 
-./less/scripts/get_info/grad/get_eval_lora_grads.sh "$TASK" "$DATA_DIR" "$MODEL_PATH" $OUTPUT_PATH "$DIMS"
-```
-You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step. After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.
+After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.
 
 ```bash
-DIM=8192 # decide which dimension to use
-GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-adam/dim${DIM}
-TRAIN_FILE_NAMES="flan_v2 cot dolly oasst1"
-CKPTS="105 211 317 420" # checkpoing index
-CHECKPOINT_WEIGHTS="1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06" # average lr of the epoch
-
-VALIDATION_GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-sgd/dim${DIM}
-TARGET_TASK_NAMES="tydiqa"
-TARGET_TASK_FILES="..."
-SELECTED_DATA_OUTPUT_PATH="../selected_data"
-MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
-
-./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$TARGET_TASK_FILES" "$SELECTED_DATA_OUTPUT_PATH" "$MODEL_PATH"
+python3 -m less.data_selection.matching \
+  --train_file_names <str> \
+  --ckpts <str> \
+  --dims <int> \
+  --checkpoint_weights <str> \
+  --target_task_names <str> \
+  --target_task_files <str> \
+  --val_task_load_method <str> \
+  --model_path <str>
 ```
+`train_file_names` is a list of training data names that you used to store the gradients.  
+`ckpts` is a list of checkpoints, e.g. `'105 211 317 420'`.  
+`dims` is the dimension of projection, default is 8192. 
+`checkpoint_weights` is a list of average lr of the epoch (check in Wandb), e.g. `'1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06'`.  
+`target_task_names` is a list of target task names that you used to store the gradients.   
+`target_task_files` can be a full path or a HF repo name, don't forget to specify the `val_task_load_method` accordingly.  
+`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.
 
 The influence score for each training data point will be saved in the `OUTPUT_PATH` directory. You can use the following script to select the top-k data points with the highest influence score. 
 
 ```bash
 python3 -m less.data_selection.write_selected_data \
---target_task_names ${TARGET_TASK_NAMES} \
---train_file_names ${TRAIN_FILE_NAMES} \
---train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
---output_path $SELECTED_DATA_OUTPUT_PATH \
---percentage 0.05
+--target_task_names <str> \
+--train_file_names <str> \
+--train_files <str> \
+--output_path <str> \
+--percentage <float>
 ```
 
 ### Step 4: Train with your selected data

diff --git a/less/data_selection/get_info.py b/less/data_selection/get_info.py
@@ -113,6 +113,8 @@ def load_adam_state(model, optimizer_state_path):
                     default="tulu", help="The chat format")
 parser.add_argument("--use_chat_format", type=bool,
                     default=True, help="Whether to use chat format")
+parser.add_argument("--val_task_load_method", type=str,
+                    default=None, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'")
 parser.add_argument("--max_length", type=int, default=2048,
                     help="The maximum length")
 parser.add_argument("--zh", default=False, action="store_true",
@@ -169,6 +171,7 @@ def load_adam_state(model, optimizer_state_path):
 if args.task is not None:
     dataset = get_dataset(args.task,
                           data_dir=args.data_dir,
+                          val_task_load_method=args.val_task_load_method,
                           tokenizer=tokenizer,
                           chat_format=args.chat_format,
                           use_chat_format=args.use_chat_format,

diff --git a/less/data_selection/get_validation_dataset.py b/less/data_selection/get_validation_dataset.py
@@ -447,22 +447,15 @@ def get_dataset(task, **kwargs):
         if tokenizer.pad_token is None:
             tokenizer.add_special_tokens({"pad_token": "<pad>"})
         kwargs["tokenizer"] = tokenizer
-    else:
-        raise ValueError("No tokenizer found")
 
     if task == "bbh":
         return get_bbh_dataset(**kwargs)
     elif task == "tydiqa":
         return get_tydiqa_dataset(**kwargs)
     elif task == "mmlu":
         return get_mmlu_dataset(**kwargs)
-
-    elif task in ["kstack_clean"]:
-        return get_custom_dataset(**kwargs, load_method="hf")
-    elif task in ["kstack"]:
-        return get_custom_dataset(**kwargs, load_method="local_json")
-    elif task in ["golden_repos", "lca_no_context"]:
-        return get_custom_dataset(**kwargs, load_method="local_hf")
+    elif "val_task_load_method" in kwargs and kwargs["val_task_load_method"] is not None:
+        return get_custom_dataset(**kwargs, load_method=kwargs["val_task_load_method"])
     else:
         raise ValueError("Invalid task name")
 

diff --git a/less/data_selection/matching.py b/less/data_selection/matching.py
@@ -7,25 +7,17 @@
 
 argparser = argparse.ArgumentParser(
     description='Script for selecting the data for training')
-argparser.add_argument('--gradient_path', type=str, default="{} ckpt{}",
-                       help='The path to the gradient file')
 argparser.add_argument('--train_file_names', type=str, nargs='+',
                        help='The name of the training file')
-argparser.add_argument('--ckpts', type=int, nargs='+',
-                       help="Checkpoint numbers.")
-argparser.add_argument('--checkpoint_weights', type=float, nargs='+',
-                       help="checkpoint weights")
+argparser.add_argument('--ckpts', type=str, help="Checkpoints, e.g. '105 211 317 420'")
+argparser.add_argument('--dims', default=8192, type=int, required=False, help='Dimention used for grads')
+argparser.add_argument('--checkpoint_weights', type=str, help="Average lr of the epoch (check in wandb)")
 argparser.add_argument('--target_task_names', type=str,
-                       nargs='+', help="The name of the target tasks")
+                       nargs='+', help="The name of the target task(s)")
 argparser.add_argument('--target_task_files', type=str, nargs='+',
-                       help='The name of the validation file')
-argparser.add_argument('--validation_gradient_path', type=str,
-                       default="{} ckpt{}", help='The path to the validation gradient file')
-argparser.add_argument('--output_path', type=str, default="selected_data",
-                       help='The path to the output')
-argparser.add_argument('--model_path', type=str, default="Model path (to initialize a tokenizer)",
-                       help='The path to the output')
-
+                       help='Can be a full path or a HF repo name')
+argparser.add_argument('--val_task_load_method', type=str, default=None, help='The method to load the validation data, can be "hf", "local_hf", "local_json"')
+argparser.add_argument('--model_path', type=str, required=True, help='Model path, e.g. llama2-7b-p0.05-lora-seed3')
 
 args = argparser.parse_args()
 
@@ -44,23 +36,26 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
         training_info, validation_info.transpose(0, 1))
     return influence_scores
 
+checkpoints = [int(i) for i in args.ckpts.split(" ")]
 
+checkpoint_weights = [float(i) for i in args.checkpoint_weights.split(" ")]
 # renormalize the checkpoint weights
-if sum(args.checkpoint_weights) != 1:
-    s = sum(args.checkpoint_weights)
-    args.checkpoint_weights = [i/s for i in args.checkpoint_weights]
+if sum(checkpoint_weights) != 1:
+    s = sum(checkpoint_weights)
+    args.checkpoint_weights = [i/s for i in checkpoint_weights]
 
 # calculate the influence score for each validation task
 for target_task_name, target_task_file in zip(args.target_task_names, args.target_task_files):
-    val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, model_path=args.model_path,)
+    model_path = os.path.join("../out", args.model_path, f"checkpoint-{checkpoints[0]}")
+    val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, 
+                              model_path=model_path, val_task_load_method=args.val_task_load_method)
     num_val_examples = len(val_dataset)
     for train_file_name in args.train_file_names:
         influence_score = 0
-        for i, ckpt in enumerate(args.ckpts):
+        for i, ckpt in enumerate(checkpoints):
             # validation_path = args.validation_gradient_path.format(
             # target_task_name, ckpt)
-            validation_path = args.validation_gradient_path.format(
-            target_task_name, ckpt)
+            validation_path = os.path.join("../grads", args.model_path, f"{target_task_name}-ckpt{ckpt}-sgd/dim{args.dims}")
             if os.path.isdir(validation_path):
                 validation_path = os.path.join(validation_path, "all_orig.pt")
             validation_info = torch.load(validation_path)
@@ -69,7 +64,7 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
                 validation_info = torch.tensor(validation_info)
             validation_info = validation_info.to(device).half()
             # gradient_path = args.gradient_path.format(train_file_name, ckpt)
-            gradient_path = args.gradient_path.format(train_file_name, ckpt)
+            gradient_path = os.path.join("../grads", args.model_path, f"{train_file_name}-ckpt{ckpt}-adam/dim{args.dims}")
             if os.path.isdir(gradient_path):
                 gradient_path = os.path.join(gradient_path, "all_orig.pt")
             training_info = torch.load(gradient_path)
@@ -84,10 +79,9 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
         influence_score = influence_score.cpu().reshape(
             influence_score.shape[0], num_val_examples, -1
         ).mean(-1).max(-1)[0]
-        output_dir = os.path.join(args.output_path, target_task_name)
+        output_dir = os.path.join("../selected_data", target_task_name)
         if not os.path.exists(output_dir):
             os.makedirs(output_dir)
-        output_file = os.path.join(
-            args.output_path, target_task_name, f"{train_file_name}_influence_score.pt")
+        output_file = os.path.join(output_dir, f"{train_file_name}_influence_score.pt")
         torch.save(influence_score, output_file)
         print("Saved influence score to {}".format(output_file))
diff --git a/less/data_selection/write_selected_data.py b/less/data_selection/write_selected_data.py
@@ -14,7 +14,7 @@ def parse_args():
     argparser.add_argument('--target_task_names', type=str,
                            nargs='+', help='The name of the target task')
     argparser.add_argument('--output_path', type=str,
-                           default="selected_data", help='The path to the output')
+                           default="../selected_data", help='The path to the output')
     argparser.add_argument('--max_samples', type=int,
                            default=None, help='The maximum number of samples')
     argparser.add_argument('--percentage', type=float, default=None,

diff --git a/less/scripts/get_info/grad/get_eval_lora_grads.py b/less/scripts/get_info/grad/get_eval_lora_grads.py
@@ -0,0 +1,42 @@
+import os
+import argparse
+import subprocess
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task", type=str, required=True, help="Task name (e.g. tydiqa, mmlu), will be used to store the gradients")
+    parser.add_argument("--data_dir", type=str, required=True, help="Path to data directory, can also be a full path or a HF repo name") 
+    parser.add_argument("--val_task_load_method", default=None,type=str, required=False, help="The method to load the validation data, can be 'hf', 'local_hf', 'local_json'")
+    parser.add_argument("--model_path", type=str, required=True, help="Path to model in the `out` directory, e.g. 'llama2-7b-p0.05-lora-seed3'")
+    parser.add_argument("--ckpts", type=str, required=True, help="List of checkpoints to compute gradients for, e.g. '105 211 317 420'")
+    parser.add_argument("--dims", default=8192, type=int, required=False, help="Dimension of projection")
+    args = parser.parse_args()
+
+     # Split checkpoint string into list of ints
+    ckpts = [int(x) for x in args.ckpts.split()]
+
+    # Process each checkpoint
+    for ckpt in ckpts:
+        # Create output directory if it doesn't exist
+        model = os.path.join("../out", args.model_path, f"checkpoint-{ckpt}")
+
+        # Construct output path with checkpoint
+        output_path = os.path.join("../grads", args.model_path, f"{args.task}-ckpt{ckpt}-sgd")
+
+        # Build command
+        cmd = [
+            "python3", "-m", "less.data_selection.get_info",
+            "--task", args.task,
+            "--info_type", "grads", 
+            "--model_path", model,
+            "--output_path", output_path,
+            "--gradient_projection_dimension", str(args.dims),
+            "--gradient_type", "sgd",
+            "--data_dir", args.data_dir,
+            "--val_task_load_method", args.val_task_load_method
+        ]
+
+        subprocess.run(cmd)
+
+if __name__ == "__main__":
+    main()