Open Freeform QA Benchmark

Dependency

https://huggingface.co/spaces/Muennighoff/code_eval_octopack
https://huggingface.co/spaces/evaluate-metric/rouge

How to run?

Prompt for each case is in cases/. You can read all *.yaml and parse prompt_path field to get the prompt path.

Then, for generated responses, store them in a folder in responses/. In the folder, params.yaml is the index from case file name to response text files.

Then, run python grader_main.py suite_v1.yaml responses/[your dir]

Results will be printed and stored in result/.

For example:

For GPT-4:

python grader_main.py suite_v1.yaml responses/gpt-4_0.2_0.9_10

Best @ 10: 28.792 / 40
For GPT-3.5-Turbo:

python grader_main.py suite_v1.yaml responses/gpt-3.5-turbo_0.2_0.9_10

Best @ 10: 22.906 / 40

Optionally, you can then analyze and get statistics of the results with result_stat.ipynb

This repo stores a benchmark for code model's freeform QA. The evaluation metric focuses on answer correctness, which is instance-specific to encode domain knowledge requirements and combines both unit-test-based and natural-language-based evaluation. The benchmark is generated from randomly-sampled popular questions from StackOverflow, covering a broad range of domains and languages.

Detail schema of suites, instances, and evaluation metrics will come later.

At the current stage, the questions are solely from uniform sampling from StackOverflow to ensure minimal bias and maximal coverage. The initial suite includes 40 questions. In later stages, we plan to adopt two directions: (1) recruit human annotators to further enlarge the size; (2) for some specific domains, we may automatically evolve to get more questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

README.MD

Open Freeform QA Benchmark

Dependency

How to run?

Files

README.MD

Latest commit

History

README.MD

File metadata and controls

Open Freeform QA Benchmark

Dependency

How to run?