SFRJudge

This is the official repository for Direct Judgement Preference Optimization. This repo currently contains code to run evaluation on 12 benchmarks featured in the paper. To run evaluation on RewardBench, please see the RewardBench repo.

The contents of this repository are intended for research purposes only.

Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data.

Setup

This code was tested with Python 3.10.8 and PyTorch 2.3.0.

conda create --name judge_eval python=3.10.8
conda activate judge_eval
pip install torch==2.3.0
pip install -r requirements.txt

Usage

See run_eval.sh for examples. Alternatively, you can run

export HF_TOKEN=<your_hf_token>
python -u main_eval.py \
    --model [model_name (Huggingface or local)] \
    --num_gpus [num_gpus] \
    --eval_dataset [dataset name] \
    --output_path [output_path] \
    --temperature [sampling parameter temperature] \
    --top_p [sampling parameter top_p] \

Here, you can specify which datasets to run evaluation on, or specify all to run all datasets, or all_pair, all_point, or all_class to run all pairwise, pointwise (single rating), and classification datasets, respectively. Some evaluation datasets are be gated on HuggingFace and/or hosted on Google Drive. Please request access with the original dataset authors accordingly.

After evaluation is finished, you can aggregate your results by running aggregate_eval.py

python aggregate_eval.py \
    --eval_path [output_path_from_above] \
    --type [all, pair, point, point_no_class] \

⚠️ To get a pairwise average that includes RewardBench, this script expects you to run evaluation on RewardBench, then store results in [output_path]/rewardbench/scores.json. The json file should contain a key titled leaderboard with a corresponding dict that has the scores of each section and a key overall_score for the overall RewardBench score, a value between 0 and 1. Alternatively, you can recompute the average by taking the average produced by this script without RewardBench by (6*script_avg_no_rb + rb_score)/7.

"leaderboard": {
    "Chat": chat_score [0,1],
    "Chat Hard": chat_hard_score [0,1],
    "Safety": safety_score [0,1],
    "Reasoning": reasoning_score [0,1],
    "overall_score": overall_score [0,1],
}

The scores reported in our paper were obtained by running evaluation with 8xA100 40GB GPUs.

Citation

@misc{wang2024directjudgementpreferenceoptimization,
      title={Direct Judgement Preference Optimization}, 
      author={Peifeng Wang and Austin Xu and Yilun Zhou and Caiming Xiong and Shafiq Joty},
      year={2024},
      eprint={2409.14664},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.14664}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
prompts		prompts
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
aggregate_eval.py		aggregate_eval.py
data_utils.py		data_utils.py
main_eval.py		main_eval.py
prompt_utils.py		prompt_utils.py
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFRJudge

Setup

Usage

Citation

About

Releases

Packages

Languages

License

SalesforceAIResearch/SFRJudge

Folders and files

Latest commit

History

Repository files navigation

SFRJudge

Setup

Usage

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages