This is an example of using Eurus-RM (Yuan et al., 2024) to perform best-of-N sampling with Llama-3 8B as the base model.
Eurus-RM-7B is trained on a mixture of UltraInteract, UltraFeedback, and UltraSafety, with a specifically designed reward modeling objective for reasoning to directly increase.
EURUS-RM-7B stands out as the best 7B RM overall and achieves similar or better performance than much larger baselines. Particularly, it outperforms GPT-4 in certain tasks.
Prerequisites:
- Download LLama-3 8B model.
- Have 2 * 24 GB GPUs.
Script:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 1 examples/Eurus/inference.py --model_dir $LLAMA3_CKPTS --best_of_n 10
We tested the performance of using Eurus-RM-7B to select the best of 10 reasoning chains generated by Llama-3 8B on GSM8k.
Method | Accuracy |
---|---|
CoT (Llama-8B) | 0.487 |
CoT (Llama-8B) +Best-of-10 (Eurus-RM-7B) | 0.726 |
@article{yuan2024advancing,
title={Advancing LLM Reasoning Generalists with Preference Trees},
author={Yuan, Lifan and Cui, Ganqu and Wang, Hanbin and Ding, Ning and Wang, Xingyao and Deng, Jia and Shan, Boji and Chen, Huimin and Xie, Ruobing and Lin, Yankai and others},
journal={arXiv preprint arXiv:2404.02078},
year={2024}
}