README formatting

JetBrains-Research · rinapch · Dec 7, 2024 · Dec 7, 2024 · Dec 7, 2024 · Dec 9, 2024
commit 7e3534e5295124b501f568f641e83e5fb4f77309
diff --git a/README.md b/README.md
@@ -56,7 +56,12 @@ The checkpoint will be saved in the `out` directory.
 Once the initial warmup training stage is completed, we will collect gradients for the entire training dataset. For each checkpoint, our goal is to obtain the gradients of all the training data that we would like to select from. An example script is shown below.
 
 ```bash
-python3 -m less/scripts/get_info/grad/get_train_lora_grads --train_data_name <str> --train_file <str> --model_path <str> --ckpts <str> --dims <int>
+python3 -m less/scripts/get_info/grad/get_train_lora_grads \
+  --train_data_name <str> \
+  --train_file <str> \
+  --model_path <str> \
+  --ckpts <str> \
+  --dims <int>
 ```
 Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.  
 `train_data_name` is the name of the training data, which will be used to store the gradients, it should be comprehensive for you to easily distignuish between different experiments.  
@@ -65,21 +70,31 @@ Ideally, you would aim to create a datastore that encompasses a gradient of all
 `ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`. The paper recommends using all four checkpoints.  
 `dims` is the dimension of projection, default is 8192.
 
+The gradients will be saved in the `grads` directory.
+
 ### Step 3: Selecting data for a task
 To select data for a particular downstream task, it's necessary to first prepare data specific to that task, using the same instruction-tuning prompt format as was employed during training. We have set up data loading modules for three evaluation datasets featured in our work: BBH, TydiQA, and MMLU. If you're interested in data selection for additional tasks, you can expand the [`less/data_selection/get_validation_dataset.py`](less/data_selection/get_validation_dataset.py) script to accommodate those tasks. Similar to obtaining gradients for training data, run the following script. The primary difference is that this process will yield SGD gradients for the validation data, following the formulation of the influence estimation. 
 
 You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step. 
 
 ```bash
-python3 -m less/scripts/get_info/grad/get_eval_lora_grads --task <str> --data_dir <str> --val_task_load_method <str> --model_path <str> --ckpts <str> --dims <int>
+python3 -m less/scripts/get_info/grad/get_eval_lora_grads \
+  --task <str> \
+  --data_dir <str> \
+  --val_task_load_method <str> \
+  --model_path <str> \
+  --ckpts <str> \
+  --dims <int>
 ```
 `task` is the name of the task, which will be used to store the gradients.  
 `data_dir` is the path to the data directory. If you are using one of the predifined datasets ("bbh", "tydiqa", "mmlu"), this should point to the data directory. If you are using your own custom dataset, this should be a full path to a JSONL file or a HF repo name.  
 `val_task_load_method` is the method to load the validation data, can be `hf`, `local_hf`, `local_json`. You should specify this if you are using your own custom dataset. Default is `None`, then it's assumned that you are using the predifined datasets.  
 `model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.  
-`ckpts` is the list of checkpoints to compute gradients for, e.g. `105 211 317 420`.  
+`ckpts` is the list of checkpoints to compute gradients for, e.g. `'105 211 317 420'`.  
 `dims` is the dimension of projection, default is 8192.
 
+The gradients will be saved in the `grads` directory.
+
 After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.
 
 ```bash
@@ -96,17 +111,35 @@ SELECTED_DATA_OUTPUT_PATH="../selected_data"
 MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
 
 ./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$TARGET_TASK_FILES" "$SELECTED_DATA_OUTPUT_PATH" "$MODEL_PATH"
+
+python3 -m less.data_selection.matching \
+  --train_file_names <str> \
+  --ckpts <str> \
+  --dims <int> \
+  --checkpoint_weights <str> \
+  --target_task_names <str> \
+  --target_task_files <str> \
+  --val_task_load_method <str> \
+  --model_path <str>
+
 ```
+`train_file_names` is a list of training data names that you used to store the gradients.  
+`ckpts` is a list of checkpoints, e.g. `'105 211 317 420'`.  
+`dims` is the dimension of projection, default is 8192. 
+`checkpoint_weights` is a list of average lr of the epoch (check in Wandb), e.g. `'1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06'`.
+`target_task_names` is a list of target task names that you used to store the gradients. 
+`target_task_files` can be a full path or a HF repo name, don't forget to specify the `val_task_load_method` accordingly.
+`model_path` is the path to the model in the `out` directory, e.g. `llama2-7b-p0.05-lora-seed3`.
 
 The influence score for each training data point will be saved in the `OUTPUT_PATH` directory. You can use the following script to select the top-k data points with the highest influence score. 
 
 ```bash
 python3 -m less.data_selection.write_selected_data \
---target_task_names ${TARGET_TASK_NAMES} \
---train_file_names ${TRAIN_FILE_NAMES} \
---train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
---output_path $SELECTED_DATA_OUTPUT_PATH \
---percentage 0.05
+--target_task_names <str> \
+--train_file_names <str> \
+--train_files <str> \
+--output_path <str> \
+--percentage <float>
 ```
 
 ### Step 4: Train with your selected data

diff --git a/less/data_selection/matching.py b/less/data_selection/matching.py
@@ -7,25 +7,17 @@
 
 argparser = argparse.ArgumentParser(
     description='Script for selecting the data for training')
-argparser.add_argument('--gradient_path', type=str, default="{} ckpt{}",
-                       help='The path to the gradient file')
-argparser.add_argument('--train_file_names', type=str, nargs='+',
+argparser.add_argument('--train_file_name', type=str, nargs='+',
                        help='The name of the training file')
-argparser.add_argument('--ckpts', type=int, nargs='+',
-                       help="Checkpoint numbers.")
-argparser.add_argument('--checkpoint_weights', type=float, nargs='+',
-                       help="checkpoint weights")
-argparser.add_argument('--target_task_names', type=str,
-                       nargs='+', help="The name of the target tasks")
-argparser.add_argument('--target_task_files', type=str, nargs='+',
-                       help='The name of the validation file')
-argparser.add_argument('--validation_gradient_path', type=str,
-                       default="{} ckpt{}", help='The path to the validation gradient file')
-argparser.add_argument('--output_path', type=str, default="selected_data",
-                       help='The path to the output')
-argparser.add_argument('--model_path', type=str, default="Model path (to initialize a tokenizer)",
-                       help='The path to the output')
-
+argparser.add_argument('--ckpts', type=int, nargs='+', help="Checkpoints, e.g. '105 211 317 420'")
+argparser.add_argument('--dims', default=8192, type=int, required=False, help='Dimention used for grads')
+argparser.add_argument('--checkpoint_weights', type=float, nargs='+', help="Average lr of the epoch (check in wandb)")
+argparser.add_argument('--target_task_name', type=str,
+                       nargs='+', help="The name of the target task")
+argparser.add_argument('--target_task_file', type=str, nargs='+',
+                       help='Can be a full path or a HF repo name')
+argparser.add_argument('--val_task_load_method', type=str, default=None, help='The method to load the validation data, can be "hf", "local_hf", "local_json"')
+argparser.add_argument('--model_path', type=str, required=True, help='Model path, e.g. llama2-7b-p0.05-lora-seed3')
 
 args = argparser.parse_args()
 
@@ -52,15 +44,15 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
 
 # calculate the influence score for each validation task
 for target_task_name, target_task_file in zip(args.target_task_names, args.target_task_files):
-    val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, model_path=args.model_path,)
+    val_dataset = get_dataset(task=target_task_name, data_dir=target_task_file, 
+                              model_path=args.model_path, val_task_load_method=args.val_task_load_method)
     num_val_examples = len(val_dataset)
     for train_file_name in args.train_file_names:
         influence_score = 0
         for i, ckpt in enumerate(args.ckpts):
             # validation_path = args.validation_gradient_path.format(
             # target_task_name, ckpt)
-            validation_path = args.validation_gradient_path.format(
-            target_task_name, ckpt)
+            validation_path = os.path.join("../grads", args.model_path, f"{train_file_name}_ckpt{ckpt}_sgd/dim{args.dims}")
             if os.path.isdir(validation_path):
                 validation_path = os.path.join(validation_path, "all_orig.pt")
             validation_info = torch.load(validation_path)
@@ -69,7 +61,7 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
                 validation_info = torch.tensor(validation_info)
             validation_info = validation_info.to(device).half()
             # gradient_path = args.gradient_path.format(train_file_name, ckpt)
-            gradient_path = args.gradient_path.format(train_file_name, ckpt)
+            gradient_path = os.path.join("../grads", args.model_path, f"{train_file_name}_ckpt{ckpt}_adam/dim{args.dims}")
             if os.path.isdir(gradient_path):
                 gradient_path = os.path.join(gradient_path, "all_orig.pt")
             training_info = torch.load(gradient_path)
@@ -84,10 +76,9 @@ def calculate_influence_score(training_info: torch.Tensor, validation_info: torc
         influence_score = influence_score.cpu().reshape(
             influence_score.shape[0], num_val_examples, -1
         ).mean(-1).max(-1)[0]
-        output_dir = os.path.join(args.output_path, target_task_name)
+        output_dir = os.path.join("../selected_data", target_task_name)
         if not os.path.exists(output_dir):
             os.makedirs(output_dir)
-        output_file = os.path.join(
-            args.output_path, target_task_name, f"{train_file_name}_influence_score.pt")
+        output_file = os.path.join(output_dir, f"{train_file_name}_influence_score.pt")
         torch.save(influence_score, output_file)
         print("Saved influence score to {}".format(output_file))