π° News β’ π Quick Start β’ π Evaluation β’ π Citation
CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.
-
CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
-
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.
[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!
Install CodeMMLU and setup dependencies via pip
:
pip install codemmlu
Generate response for CodeMMLU MCQs benchmark:
codemmlu --model_name <your_model_name_or_path> \
--subset <subset> \
--backend <backend> \
--output_dir <your_output_dir>
Build codemmlu
from source:
git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .
Note
If you prefer vllm
backend, we highly recommend you install vllm from official project before install codemmlu
.
Generating with CodeMMLU questions:
codemmlu --model_name <your_model_name_or_path> \
--peft_model <your_peft_model_name_or_path> \
--subset all \
--batch_size 16 \
--backend [vllm|hf] \
--max_new_tokens 1024 \
--temperature 0.0 \
--output_dir <your_output_dir> \
--instruction_prefix <special_prefix> \
--assistant_prefix <special_prefix> \
--cache_dir <your_cache_dir>
β¬ API Usage :: click to expand ::
codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX]
[--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME]
[--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS]
[--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code]
==================== CodeMMLU ====================
optional arguments:
-h, --help show this help message and exit
-V, --version Get version
--subset SUBSET Select evaluate subset
--batch_size BATCH_SIZE
--instruction_prefix INSTRUCTION_PREFIX
--assistant_prefix ASSISTANT_PREFIX
--output_dir OUTPUT_DIR
Save generation and result path
--model_name MODEL_NAME
Local path or Huggingface Hub link to load model
--peft_model PEFT_MODEL
Lora config
--backend BACKEND LLM generation backend (default: hf)
--max_new_tokens MAX_NEW_TOKENS
Number of max new tokens
--temperature TEMPERATURE
--prompt_mode PROMPT_MODE
Prompt available: zeroshot, fewshot, cot_zs, cot_fs
--cache_dir CACHE_DIR
Cache for save model download checkpoint and dataset
--trust_remote_code
List of supported backends:
Backend | DecoderModel | LoRA |
---|---|---|
Transformers (hf) | β | β |
Vllm (vllm) | β | β |
To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.
If you find this repository useful, please consider citing our paper:
@article{nguyen2024codemmlu,
title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
journal={arXiv preprint},
year={2024}
}