neurips2023_distshift

The goal of this work is to study reward model performance under distribution shift. This code is for the paper "A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift".

Word Perturbations

We artificially induce distribution shift by perturbing words with some probability (where the perturbation is either an insertion, deletion, or replacement with a random word). A higher probability means a larger distribution shift. This induces distribution shift because these perturbations cause the prompts and responses to be more non-sensical - and therefore more dissimilar to the prompts and responses in the training set.

Reward Scores

The raw reward model scores are in model_scores/ for each model and dataset studied. Folders with scores run on word perturbation are titled word_perturb (example: model_scores/deberta_v3_large/open_ai_summarize_from_feedback/word_perturb), where the subfolder indicates the size of the subset use. Each folder may contain multiple trials.

Translations

Here we study the distribution shift of different languages, where we translate datasets from English to another language (and then back to English).

Data

For each dataset, we translated prompts and each response. We then translated back to English.

OpenAI Summarize From Feedback

We translated the validation set (about 86k examples).The original dataset can be found here. All translations can be found here.

SHP

We translated the test set for samples where the score ratio is >= 2. The original dataset can be found here. All translations can be found here.

Reward Scores

For each of the above datasets, with translated prompts / responses as listed above, we used the OpenAssistant's Deberta Reward model to get scores for each prompt/response pair. These can all be found in model_scores/. Folders with scores run on translations are title translation (example model_scores/deberta_v3_large/open_ai_summarize_from_feedback/translation. We have the original scores for each dataset. Additionally, for each language, we have 4 scores: both the prompt and response in the translated language, just the prompt translated, just the response translated, and then both the prompt and response translated back to english.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
misc/perturb_chars		misc/perturb_chars
model_scores		model_scores
notebooks		notebooks
scripts		scripts
README.md		README.md
calc_ece.py		calc_ece.py
perturb_word_evaluation.py		perturb_word_evaluation.py
reward_score.py		reward_score.py
translate_text.py		translate_text.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

neurips2023_distshift

Word Perturbations

Reward Scores

Translations

Data

OpenAI Summarize From Feedback

SHP

Reward Scores

About

Releases

Packages

Contributors 2

Languages

Pikus16/neurips2023_distshift

Folders and files

Latest commit

History

Repository files navigation

neurips2023_distshift

Word Perturbations

Reward Scores

Translations

Data

OpenAI Summarize From Feedback

SHP

Reward Scores

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages