We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local
CLI
No response
训练过程中出现的这些情况,我发现DPO训练后的效果反而没有普通微调过的效果好,有没有大佬帮帮,看看我的训练有没有问题: {'loss': 0.6931, 'grad_norm': 2.299546957015991, 'learning_rate': 1e-05, 'rewards/chosen': -0.05413971096277237, 'rewards/rejected': -0.054256755858659744, 'rewards/accuracies': 0.25, 'rewards/margins': 0.00011704633652698249, 'logps/chosen': -252.8257598876953, 'logps/rejected': -254.96127319335938, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.03} {'loss': 0.6928, 'grad_norm': 0.8526943922042847, 'learning_rate': 2e-05, 'rewards/chosen': -0.08855850994586945, 'rewards/rejected': -0.08932790160179138, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': 0.0007693897932767868, 'logps/chosen': -262.14874267578125, 'logps/rejected': -264.9452819824219, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.05} {'loss': 0.691, 'grad_norm': 0.769097626209259, 'learning_rate': 3e-05, 'rewards/chosen': -0.12483439594507217, 'rewards/rejected': -0.12917408347129822, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.004339695908129215, 'logps/chosen': -281.1364440917969, 'logps/rejected': -284.90081787109375, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.08} {'loss': 0.6846, 'grad_norm': 1.830174446105957, 'learning_rate': 4e-05, 'rewards/chosen': -0.324862539768219, 'rewards/rejected': -0.3433191180229187, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.01845661923289299, 'logps/chosen': -304.3634338378906, 'logps/rejected': -308.95806884765625, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.11}
import os from dataclasses import dataclass, field from typing import Dict, Optional import torch from accelerate import Accelerator from datasets import Dataset, load_dataset from peft import LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, set_seed,BitsAndBytesConfig from trl import DPOConfig, DPOTrainer import pandas as pd from utils import dpo_dataset # Define and parse arguments. @dataclass class ScriptArguments: """ The arguments for the DPO training script. """ # data parameters beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"}) # training parameters model_name_or_path: Optional[str] = field( default="/root/autodl-tmp/patentClassDPOTrain/DPO_output3/finetuend-Qwen", metadata={"help": "the location of the SFT model name or path"}, ) ref_model_name_or_path:Optional[str] = field( default="/root/autodl-tmp/patentClassDPOTrain/output3/finetuend-Qwen", metadata={"help": "the location of the SFT model name or path"}, ) dataset_path:Optional[str] = field(default="/root/autodl-tmp/patentClassDPOTrain/DPO_dataset3") learning_rate: Optional[float] = field(default=1e-4, metadata={"help": "optimizer learning rate"}) lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"}) warmup_steps: Optional[int] = field(default=100, metadata={"help": "the number of warmup steps"}) weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"}) optimizer_type: Optional[str] = field(default="paged_adamw_32bit", metadata={"help": "the optimizer type"}) per_device_train_batch_size: Optional[int] = field(default=2, metadata={"help": "train batch size per device"}) per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"}) gradient_accumulation_steps: Optional[int] = field( default=4, metadata={"help": "the number of gradient accumulation steps"} ) gradient_checkpointing: Optional[bool] = field( default=True, metadata={"help": "whether to use gradient checkpointing"} ) gradient_checkpointing_use_reentrant: Optional[bool] = field( default=False, metadata={"help": "whether to use reentrant for gradient checkpointing"} ) lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"}) lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"}) lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"}) max_prompt_length: Optional[int] = field(default=768, metadata={"help": "the maximum prompt length"}) max_length: Optional[int] = field(default=896, metadata={"help": "the maximum sequence length"}) max_steps: Optional[int] = field(default=2500, metadata={"help": "max number of training steps"}) num_epochs:Optional[int] = field(default=3) logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"}) save_steps: Optional[int] = field(default=500, metadata={"help": "the saving frequency"}) eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"}) output_dir: Optional[str] = field(default="/root/autodl-tmp/patentClassDPOTrain/DPO_output2", metadata={"help": "the output directory"}) log_freq: Optional[int] = field(default=1, metadata={"help": "the logging frequency"}) load_in_4bit: Optional[bool] = field(default=True, metadata={"help": "whether to load the model in 4bit"}) model_dtype: Optional[str] = field( default="bfloat16", metadata={"help": "model_dtype[float16, bfloat16, float] for loading."} ) # instrumentation report_to: Optional[str] = field( default="wandb", metadata={ "help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,' '`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. ' 'Use `"all"` to report to all integrations installed, `"none"` for no integrations.' }, ) # debug argument for distributed training ignore_bias_buffers: Optional[bool] = field( default=False, metadata={ "help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See" "https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992" }, ) seed: Optional[int] = field( default=0, metadata={"help": "Random seed that will be set at the beginning of training."} ) def get_dataset( data_dir: str = "data/rl", tokenizer = None, max_length = 1024, max_prompt_length = 768, cache_dir: Optional[str] = None, num_proc=24, ) -> Dataset: def process_func(example): # 构造DPO训练格式数据 return { "prompt": f"你是一个专利分类领域的专家,你会接收到如下的专利内容:专利标题:{example['title']}。\n摘要:{example['abstract']}。\n权利要求:{example['claim']},请输出该专利的专利分类号主IPC以及其附加IPC,按重要性排序" + "\n\nResponse:", "chosen": example["chosen"], "rejected": example["reject"], } eval_df = pd.read_json(os.path.join(data_dir,"eval_dataset.json")) train_df = pd.read_json(os.path.join(data_dir,"training_dataset.json")) valid_data = Dataset.from_pandas(eval_df) train_data = Dataset.from_pandas(train_df) original_columns = train_data.column_names train_data = train_data.map( process_func, # fn_kwargs={"tokenizer": tokenizer, "max_length": args.seq_length,"q_length":args.q_length,"a_length":args.a_length}, remove_columns=original_columns ) valid_data = valid_data.map( process_func, # fn_kwargs={"tokenizer": tokenizer, "max_length": args.seq_length,"q_length":args.q_length,"a_length":args.a_length}, remove_columns=original_columns ) return dpo_dataset(data = train_data, tokenizer = tokenizer, max_seq_length = max_length,max_prompt_length=max_prompt_length),dpo_dataset(data = valid_data, tokenizer = tokenizer, max_seq_length = max_length,max_prompt_length=max_prompt_length) if __name__ == "__main__": parser = HfArgumentParser(ScriptArguments) script_args = parser.parse_args_into_dataclasses()[0] set_seed(script_args.seed) print("loading_data",script_args.dataset_path) print("output_path ",script_args.output_dir) # 1. load a pretrained model torch_dtype = torch.float if script_args.model_dtype == "float16": torch_dtype = torch.float16 elif script_args.model_dtype == "bfloat16": torch_dtype = torch.bfloat16 # Set up model configuration bnb_config = None if script_args.load_in_4bit: bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( script_args.model_name_or_path, low_cpu_mem_usage=True, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map="auto", ) # model_ref = AutoModelForCausalLM.from_pretrained( # script_args.ref_model_name_or_path, # low_cpu_mem_usage=True, # torch_dtype=torch_dtype, # quantization_config=bnb_config, # device_map={"": Accelerator().local_process_index}, # ) model.config.use_cache = False if script_args.ignore_bias_buffers: # torch distributed hack model._ddp_params_and_buffers_to_ignore = [ name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool ] tokenizer = AutoTokenizer.from_pretrained(script_args.model_name_or_path) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token train_dataset,eval_dataset = get_dataset(script_args.dataset_path,tokenizer,script_args.max_length,script_args.max_prompt_length) # 4. initialize training arguments: training_args = DPOConfig( per_device_train_batch_size=script_args.per_device_train_batch_size, per_device_eval_batch_size=script_args.per_device_eval_batch_size, # max_steps=script_args.max_steps, logging_steps=script_args.logging_steps, # save_steps=script_args.save_steps, gradient_accumulation_steps=script_args.gradient_accumulation_steps, gradient_checkpointing=script_args.gradient_checkpointing, learning_rate=script_args.learning_rate, max_length = script_args.max_length, max_prompt_length = script_args.max_prompt_length, num_train_epochs = script_args.num_epochs, eval_strategy="steps", eval_steps=script_args.eval_steps, output_dir=script_args.output_dir, report_to=script_args.report_to, lr_scheduler_type=script_args.lr_scheduler_type, warmup_steps=script_args.warmup_steps, optim=script_args.optimizer_type, bf16=True, remove_unused_columns=False, run_name="dpo_qwen2", gradient_checkpointing_kwargs=dict(use_reentrant=script_args.gradient_checkpointing_use_reentrant), seed=script_args.seed, ) peft_config = LoraConfig( r=script_args.lora_r, lora_alpha=script_args.lora_alpha, lora_dropout=script_args.lora_dropout, target_modules=[ "q_proj", "v_proj", "k_proj", "out_proj", "fc_in", "fc_out", "wte", ], bias="none", task_type="CAUSAL_LM", ) # 5. initialize the DPO trainer dpo_trainer = DPOTrainer( model, ref_model=None, args=training_args, beta=script_args.beta, train_dataset=train_dataset, eval_dataset=eval_dataset, processing_class=tokenizer, peft_config=peft_config, ) # 6. train dpo_trainer.train() dpo_trainer.save_model(script_args.output_dir) # 7. save output_dir = os.path.join(script_args.output_dir, "final_checkpoint") dpo_trainer.model.save_pretrained(output_dir)
#!/usr/bin/env python3 # -*- coding: utf-8 -*- ######################################################################## # # Copyright (c) 2024 Sugar. All Rights Reserved # ######################################################################## """ File: utils.py Desc: 数据处理代码 Author: sugar(@google.com) Date: 2024-03-27 15:19 """ import torch from loguru import logger from datasets import load_dataset from torch.utils.data import Dataset,DataLoader from transformers import TrainingArguments, TrainerCallback class dpo_dataset(Dataset): def __init__(self, data, tokenizer, max_seq_length, max_prompt_length): self.tokenizer = tokenizer self.max_seq_length = max_seq_length self.data_list = data self.max_prompt_length = max_prompt_length def __len__(self): return len(self.data_list) def __getitem__(self, index): data = self.data_list[index] prompt_encoded = self.tokenizer.encode( '<|im_start|>' + data['prompt'] + '<|im_end|>', add_special_tokens=False, truncation=True, max_length=self.max_prompt_length ) prompt_input_ids = prompt_encoded[:self.max_prompt_length] prompt_attention_mask = [1] * len(prompt_input_ids) padding_length = self.max_prompt_length - len(prompt_input_ids) if padding_length > 0: prompt_input_ids += [self.tokenizer.pad_token_id] * padding_length prompt_attention_mask += [0] * padding_length chosen_input_ids = self.tokenizer.encode(data['chosen'], add_special_tokens=False) rejected_input_ids = self.tokenizer.encode(data['rejected'], add_special_tokens=False) chosen_input_ids = prompt_input_ids + chosen_input_ids + [self.tokenizer.pad_token_id] rejected_input_ids = prompt_input_ids + rejected_input_ids + [self.tokenizer.pad_token_id] chosen_labels = [-100] * self.max_prompt_length + chosen_input_ids[self.max_prompt_length:] rejected_labels = [-100] * self.max_prompt_length + rejected_input_ids[self.max_prompt_length:] inputs = dict( prompt_input_ids=prompt_input_ids, prompt_attention_mask=prompt_attention_mask, chosen_input_ids=chosen_input_ids, chosen_attention_mask=[1] * len(chosen_input_ids), chosen_labels=chosen_labels, rejected_input_ids=rejected_input_ids, rejected_attention_mask=[1] * len(rejected_input_ids), rejected_labels=rejected_labels, ) return inputs def map(self, func, **kwargs): return self # class dpo_dataset(Dataset): def __init__(self,data,tokenizer,max_seq_length,max_prompt_length): self.tokenizer = tokenizer self.max_seq_length = max_seq_length # 打开json文件 用transformers self.data_list = data self.max_prompt_length = max_prompt_length def __len__(self): return len(self.data_list) def __getitem__(self,index): # 取出data_list的一条数据 --> {"chosen":xxx,"rejected":xxx,"prompt":xxx} 一条数据是这样的格式 data = self.data_list[index] # 对prompt reject和chosen进行tokenize 判断是否需要截断 保证所有的input_ids都一样 不够长度的直接padding # 适配qwen 的 template 添加eos token prompt_input_ids = self.tokenizer.encode('<|im_start|>' + data['prompt'] + '<|im_end|>',add_special_tokens=False) chosen_input_ids = self.tokenizer.encode(data['chosen'],add_special_tokens=False) rejected_input_ids = self.tokenizer.encode(data['rejected'],add_special_tokens=False) prompt_input_ids = prompt_input_ids + [self.tokenizer.pad_token_id] # 设置labels chosen_labels = [-100] * len(prompt_input_ids) + chosen_input_ids + [self.tokenizer.pad_token_id] rejected_labels = [-100] * len(prompt_input_ids) + rejected_input_ids + [self.tokenizer.pad_token_id] chosen_input_ids = prompt_input_ids + chosen_input_ids + [self.tokenizer.pad_token_id] rejected_input_ids = prompt_input_ids + rejected_input_ids + [self.tokenizer.pad_token_id] assert len(chosen_labels) == len(chosen_input_ids) assert len(rejected_labels) == len(rejected_input_ids) inputs = dict( prompt_input_ids=prompt_input_ids, prompt_attention_mask=[1]*len(prompt_input_ids), chosen_input_ids=chosen_input_ids, chosen_attention_mask=[1]*len(chosen_input_ids), chosen_labels=chosen_labels, rejected_input_ids=rejected_input_ids, rejected_attention_mask=[1]*len(rejected_input_ids), rejected_labels=rejected_labels, ) return inputs def map(self, func, **kwargs): return self
The text was updated successfully, but these errors were encountered:
i am facing the same issue with ORPO came here looking for a reason why it is happening
is this an error and we should just start again the trainer ?
Sorry, something went wrong.
No branches or pull requests
Prerequisites
Backend
Local
Interface Used
CLI
CLI Command
No response
UI Screenshots & Parameters
No response
Error Logs
训练过程中出现的这些情况,我发现DPO训练后的效果反而没有普通微调过的效果好,有没有大佬帮帮,看看我的训练有没有问题:
{'loss': 0.6931, 'grad_norm': 2.299546957015991, 'learning_rate': 1e-05, 'rewards/chosen': -0.05413971096277237, 'rewards/rejected': -0.054256755858659744, 'rewards/accuracies': 0.25, 'rewards/margins': 0.00011704633652698249, 'logps/chosen': -252.8257598876953, 'logps/rejected': -254.96127319335938, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.03}
{'loss': 0.6928, 'grad_norm': 0.8526943922042847, 'learning_rate': 2e-05, 'rewards/chosen': -0.08855850994586945, 'rewards/rejected': -0.08932790160179138, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': 0.0007693897932767868, 'logps/chosen': -262.14874267578125, 'logps/rejected': -264.9452819824219, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.05}
{'loss': 0.691, 'grad_norm': 0.769097626209259, 'learning_rate': 3e-05, 'rewards/chosen': -0.12483439594507217, 'rewards/rejected': -0.12917408347129822, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.004339695908129215, 'logps/chosen': -281.1364440917969, 'logps/rejected': -284.90081787109375, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.08}
{'loss': 0.6846, 'grad_norm': 1.830174446105957, 'learning_rate': 4e-05, 'rewards/chosen': -0.324862539768219, 'rewards/rejected': -0.3433191180229187, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.01845661923289299, 'logps/chosen': -304.3634338378906, 'logps/rejected': -308.95806884765625, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.11}
Additional Information
我是新手,还希望大佬能多多指导,感谢!!!
我想了解的是想微调或者DPO训练的时候,是否都需要进行构造input_ids等这些,然后再作为数据集进行训练,或者说一般的流程是什么?不太清楚这个过程
下面是我的DPO训练代码
构建数据集的代码,这里需要在传入的时候先处理吗?
The text was updated successfully, but these errors were encountered: