Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO training中的 'logits/chosen':nan,'logits/rejected':nan #818

Open
2 tasks done
ZengQQQ opened this issue Dec 3, 2024 · 1 comment
Open
2 tasks done

DPO training中的 'logits/chosen':nan,'logits/rejected':nan #818

ZengQQQ opened this issue Dec 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ZengQQQ
Copy link

ZengQQQ commented Dec 3, 2024

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

No response

UI Screenshots & Parameters

No response

Error Logs

训练过程中出现的这些情况,我发现DPO训练后的效果反而没有普通微调过的效果好,有没有大佬帮帮,看看我的训练有没有问题:
{'loss': 0.6931, 'grad_norm': 2.299546957015991, 'learning_rate': 1e-05, 'rewards/chosen': -0.05413971096277237, 'rewards/rejected': -0.054256755858659744, 'rewards/accuracies': 0.25, 'rewards/margins': 0.00011704633652698249, 'logps/chosen': -252.8257598876953, 'logps/rejected': -254.96127319335938, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.03}
{'loss': 0.6928, 'grad_norm': 0.8526943922042847, 'learning_rate': 2e-05, 'rewards/chosen': -0.08855850994586945, 'rewards/rejected': -0.08932790160179138, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': 0.0007693897932767868, 'logps/chosen': -262.14874267578125, 'logps/rejected': -264.9452819824219, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.05}
{'loss': 0.691, 'grad_norm': 0.769097626209259, 'learning_rate': 3e-05, 'rewards/chosen': -0.12483439594507217, 'rewards/rejected': -0.12917408347129822, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.004339695908129215, 'logps/chosen': -281.1364440917969, 'logps/rejected': -284.90081787109375, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.08}
{'loss': 0.6846, 'grad_norm': 1.830174446105957, 'learning_rate': 4e-05, 'rewards/chosen': -0.324862539768219, 'rewards/rejected': -0.3433191180229187, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.01845661923289299, 'logps/chosen': -304.3634338378906, 'logps/rejected': -308.95806884765625, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.11}

Additional Information

我是新手,还希望大佬能多多指导,感谢!!!

我想了解的是想微调或者DPO训练的时候,是否都需要进行构造input_ids等这些,然后再作为数据集进行训练,或者说一般的流程是什么?不太清楚这个过程

下面是我的DPO训练代码

import os
from dataclasses import dataclass, field
from typing import Dict, Optional
import torch
from accelerate import Accelerator
from datasets import Dataset, load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, set_seed,BitsAndBytesConfig
from trl import DPOConfig, DPOTrainer
import pandas as pd
from utils import dpo_dataset

# Define and parse arguments.
@dataclass
class ScriptArguments:
    """
    The arguments for the DPO training script.
    """

    # data parameters
    beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"})

    # training parameters
    model_name_or_path: Optional[str] = field(
        default="/root/autodl-tmp/patentClassDPOTrain/DPO_output3/finetuend-Qwen",
        metadata={"help": "the location of the SFT model name or path"},
    )
    ref_model_name_or_path:Optional[str] = field(
        default="/root/autodl-tmp/patentClassDPOTrain/output3/finetuend-Qwen",
        metadata={"help": "the location of the SFT model name or path"},
    )
    
    dataset_path:Optional[str] = field(default="/root/autodl-tmp/patentClassDPOTrain/DPO_dataset3")
    learning_rate: Optional[float] = field(default=1e-4, metadata={"help": "optimizer learning rate"})
    lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"})
    warmup_steps: Optional[int] = field(default=100, metadata={"help": "the number of warmup steps"})
    weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"})
    optimizer_type: Optional[str] = field(default="paged_adamw_32bit", metadata={"help": "the optimizer type"})

    per_device_train_batch_size: Optional[int] = field(default=2, metadata={"help": "train batch size per device"})
    per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"})
    gradient_accumulation_steps: Optional[int] = field(
        default=4, metadata={"help": "the number of gradient accumulation steps"}
    )
    gradient_checkpointing: Optional[bool] = field(
        default=True, metadata={"help": "whether to use gradient checkpointing"}
    )

    gradient_checkpointing_use_reentrant: Optional[bool] = field(
        default=False, metadata={"help": "whether to use reentrant for gradient checkpointing"}
    )

    lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
    lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"})
    lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"})

    max_prompt_length: Optional[int] = field(default=768, metadata={"help": "the maximum prompt length"})
    max_length: Optional[int] = field(default=896, metadata={"help": "the maximum sequence length"})
    max_steps: Optional[int] = field(default=2500, metadata={"help": "max number of training steps"})
    num_epochs:Optional[int] = field(default=3)
    logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"})
    save_steps: Optional[int] = field(default=500, metadata={"help": "the saving frequency"})
    eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"})

    output_dir: Optional[str] = field(default="/root/autodl-tmp/patentClassDPOTrain/DPO_output2", metadata={"help": "the output directory"})
    log_freq: Optional[int] = field(default=1, metadata={"help": "the logging frequency"})
    load_in_4bit: Optional[bool] = field(default=True, metadata={"help": "whether to load the model in 4bit"})
    model_dtype: Optional[str] = field(
        default="bfloat16", metadata={"help": "model_dtype[float16, bfloat16, float] for loading."}
    )
    # instrumentation
    report_to: Optional[str] = field(
        default="wandb",
        metadata={
            "help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,'
            '`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. '
            'Use `"all"` to report to all integrations installed, `"none"` for no integrations.'
        },
    )
    # debug argument for distributed training
    ignore_bias_buffers: Optional[bool] = field(
        default=False,
        metadata={
            "help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See"
            "https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992"
        },
    )
    seed: Optional[int] = field(
        default=0, metadata={"help": "Random seed that will be set at the beginning of training."}
    )


def get_dataset(
    data_dir: str = "data/rl",    
    tokenizer = None,
    max_length = 1024,
    max_prompt_length = 768,
    cache_dir: Optional[str] = None,
    num_proc=24,
) -> Dataset:
    def process_func(example):
    # 构造DPO训练格式数据
        return {
                "prompt": f"你是一个专利分类领域的专家,你会接收到如下的专利内容:专利标题:{example['title']}\n摘要:{example['abstract']}\n权利要求:{example['claim']},请输出该专利的专利分类号主IPC以及其附加IPC,按重要性排序" + "\n\nResponse:",
                "chosen": example["chosen"],
                "rejected": example["reject"],
            }
        
    eval_df = pd.read_json(os.path.join(data_dir,"eval_dataset.json"))
    train_df = pd.read_json(os.path.join(data_dir,"training_dataset.json"))
    
    valid_data = Dataset.from_pandas(eval_df)
    train_data = Dataset.from_pandas(train_df)
    
    original_columns = train_data.column_names

    train_data = train_data.map(
        process_func, 
        # fn_kwargs={"tokenizer": tokenizer, "max_length": args.seq_length,"q_length":args.q_length,"a_length":args.a_length},
        remove_columns=original_columns
    )
    
    
    valid_data = valid_data.map(
        process_func, 
        # fn_kwargs={"tokenizer": tokenizer, "max_length": args.seq_length,"q_length":args.q_length,"a_length":args.a_length},
        remove_columns=original_columns
    )
    
    return dpo_dataset(data = train_data, tokenizer = tokenizer, max_seq_length = max_length,max_prompt_length=max_prompt_length),dpo_dataset(data = valid_data, tokenizer = tokenizer, max_seq_length = max_length,max_prompt_length=max_prompt_length)


if __name__ == "__main__":
    parser = HfArgumentParser(ScriptArguments)
    script_args = parser.parse_args_into_dataclasses()[0]

    set_seed(script_args.seed)
    print("loading_data",script_args.dataset_path)
    print("output_path  ",script_args.output_dir)
    # 1. load a pretrained model
    torch_dtype = torch.float
    if script_args.model_dtype == "float16":
        torch_dtype = torch.float16
    elif script_args.model_dtype == "bfloat16":
        torch_dtype = torch.bfloat16
        
            # Set up model configuration
    bnb_config = None
    if script_args.load_in_4bit:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
        )

    model = AutoModelForCausalLM.from_pretrained(
        script_args.model_name_or_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch_dtype,
        quantization_config=bnb_config,
        device_map="auto",
    )
    
    # model_ref = AutoModelForCausalLM.from_pretrained(
    #     script_args.ref_model_name_or_path,
    #     low_cpu_mem_usage=True,
    #     torch_dtype=torch_dtype,
    #     quantization_config=bnb_config,
    #     device_map={"": Accelerator().local_process_index},
    # )
    
    model.config.use_cache = False

    if script_args.ignore_bias_buffers:
        # torch distributed hack
        model._ddp_params_and_buffers_to_ignore = [
            name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
        ]

    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name_or_path)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token


    train_dataset,eval_dataset = get_dataset(script_args.dataset_path,tokenizer,script_args.max_length,script_args.max_prompt_length)


    # 4. initialize training arguments:
    training_args = DPOConfig(
        per_device_train_batch_size=script_args.per_device_train_batch_size,
        per_device_eval_batch_size=script_args.per_device_eval_batch_size,
        # max_steps=script_args.max_steps,
        logging_steps=script_args.logging_steps,
        # save_steps=script_args.save_steps,
        gradient_accumulation_steps=script_args.gradient_accumulation_steps,
        gradient_checkpointing=script_args.gradient_checkpointing,
        learning_rate=script_args.learning_rate,
        max_length = script_args.max_length,
        max_prompt_length = script_args.max_prompt_length,
        num_train_epochs = script_args.num_epochs,
        eval_strategy="steps",
        eval_steps=script_args.eval_steps,
        output_dir=script_args.output_dir,
        report_to=script_args.report_to,
        lr_scheduler_type=script_args.lr_scheduler_type,
        warmup_steps=script_args.warmup_steps,
        optim=script_args.optimizer_type,
        bf16=True,
        remove_unused_columns=False,
        run_name="dpo_qwen2",
        gradient_checkpointing_kwargs=dict(use_reentrant=script_args.gradient_checkpointing_use_reentrant),
        seed=script_args.seed,
    )

    peft_config = LoraConfig(
        r=script_args.lora_r,
        lora_alpha=script_args.lora_alpha,
        lora_dropout=script_args.lora_dropout,
        target_modules=[
            "q_proj",
            "v_proj",
            "k_proj",
            "out_proj",
            "fc_in",
            "fc_out",
            "wte",
        ],
        bias="none",
        task_type="CAUSAL_LM",
    )

    # 5. initialize the DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        ref_model=None,
        args=training_args,
        beta=script_args.beta,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        processing_class=tokenizer,
        peft_config=peft_config,
    )

    # 6. train
    dpo_trainer.train()
    dpo_trainer.save_model(script_args.output_dir)

    # 7. save
    output_dir = os.path.join(script_args.output_dir, "final_checkpoint")
    dpo_trainer.model.save_pretrained(output_dir)

构建数据集的代码,这里需要在传入的时候先处理吗?

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
########################################################################
#
# Copyright (c) 2024 Sugar. All Rights Reserved
#
########################################################################

"""
    File: utils.py
    Desc: 数据处理代码
    Author: sugar(@google.com)
    Date: 2024-03-27 15:19
"""
import torch
from loguru import logger
from datasets import load_dataset
from torch.utils.data import Dataset,DataLoader
from transformers import TrainingArguments, TrainerCallback

class dpo_dataset(Dataset):
    def __init__(self, data, tokenizer, max_seq_length, max_prompt_length):
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length
        self.data_list = data
        self.max_prompt_length = max_prompt_length
        
    def __len__(self):
        return len(self.data_list)
    
    def __getitem__(self, index):
        data = self.data_list[index]

        prompt_encoded = self.tokenizer.encode(
            '<|im_start|>' + data['prompt'] + '<|im_end|>',
            add_special_tokens=False,
            truncation=True,
            max_length=self.max_prompt_length
        )
        prompt_input_ids = prompt_encoded[:self.max_prompt_length]
        prompt_attention_mask = [1] * len(prompt_input_ids)
        padding_length = self.max_prompt_length - len(prompt_input_ids)
        if padding_length > 0:
            prompt_input_ids += [self.tokenizer.pad_token_id] * padding_length
            prompt_attention_mask += [0] * padding_length

        chosen_input_ids = self.tokenizer.encode(data['chosen'], add_special_tokens=False)
        rejected_input_ids = self.tokenizer.encode(data['rejected'], add_special_tokens=False)

        chosen_input_ids = prompt_input_ids + chosen_input_ids + [self.tokenizer.pad_token_id]
        rejected_input_ids = prompt_input_ids + rejected_input_ids + [self.tokenizer.pad_token_id]

        chosen_labels = [-100] * self.max_prompt_length + chosen_input_ids[self.max_prompt_length:]
        rejected_labels = [-100] * self.max_prompt_length + rejected_input_ids[self.max_prompt_length:]

        inputs = dict(
            prompt_input_ids=prompt_input_ids,
            prompt_attention_mask=prompt_attention_mask,
            chosen_input_ids=chosen_input_ids,
            chosen_attention_mask=[1] * len(chosen_input_ids),
            chosen_labels=chosen_labels,
            rejected_input_ids=rejected_input_ids,
            rejected_attention_mask=[1] * len(rejected_input_ids),
            rejected_labels=rejected_labels,
        )
        return inputs
    
    def map(self, func, **kwargs):
        return self

# class dpo_dataset(Dataset):
    def __init__(self,data,tokenizer,max_seq_length,max_prompt_length):
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length
        # 打开json文件 用transformers
        self.data_list = data
        self.max_prompt_length = max_prompt_length
        
    def __len__(self):
        return len(self.data_list)
    def __getitem__(self,index):
        # 取出data_list的一条数据  --> {"chosen":xxx,"rejected":xxx,"prompt":xxx} 一条数据是这样的格式
        data = self.data_list[index]

        # 对prompt reject和chosen进行tokenize  判断是否需要截断 保证所有的input_ids都一样 不够长度的直接padding  
        # 适配qwen 的 template  添加eos token
        prompt_input_ids = self.tokenizer.encode('<|im_start|>' + data['prompt'] + '<|im_end|>',add_special_tokens=False)
        chosen_input_ids = self.tokenizer.encode(data['chosen'],add_special_tokens=False)
        rejected_input_ids = self.tokenizer.encode(data['rejected'],add_special_tokens=False)

        prompt_input_ids = prompt_input_ids + [self.tokenizer.pad_token_id]
        # 设置labels
        chosen_labels = [-100] * len(prompt_input_ids) + chosen_input_ids + [self.tokenizer.pad_token_id]
        rejected_labels = [-100] * len(prompt_input_ids) + rejected_input_ids + [self.tokenizer.pad_token_id]
        chosen_input_ids = prompt_input_ids + chosen_input_ids + [self.tokenizer.pad_token_id]
        rejected_input_ids = prompt_input_ids + rejected_input_ids + [self.tokenizer.pad_token_id]

        assert len(chosen_labels) == len(chosen_input_ids)
        assert len(rejected_labels) == len(rejected_input_ids)

        inputs = dict(
            prompt_input_ids=prompt_input_ids,
            prompt_attention_mask=[1]*len(prompt_input_ids),
            chosen_input_ids=chosen_input_ids,
            chosen_attention_mask=[1]*len(chosen_input_ids),
            chosen_labels=chosen_labels,
            rejected_input_ids=rejected_input_ids,
            rejected_attention_mask=[1]*len(rejected_input_ids),
            rejected_labels=rejected_labels,
        )
        return inputs
    def map(self, func, **kwargs):
        return self

@ZengQQQ ZengQQQ added the bug Something isn't working label Dec 3, 2024
@jmparejaz
Copy link

jmparejaz commented Dec 5, 2024

i am facing the same issue with ORPO
came here looking for a reason why it is happening

is this an error and we should just start again the trainer ?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants