diff --git a/docs/proposals/scenarios/llm-benchmarks/images/changed_part.png b/docs/proposals/scenarios/llm-benchmarks/images/changed_part.png new file mode 100644 index 00000000..91cdd194 Binary files /dev/null and b/docs/proposals/scenarios/llm-benchmarks/images/changed_part.png differ diff --git a/docs/proposals/scenarios/llm-benchmarks/images/changed_sturcture.png b/docs/proposals/scenarios/llm-benchmarks/images/changed_sturcture.png new file mode 100644 index 00000000..a999769f Binary files /dev/null and b/docs/proposals/scenarios/llm-benchmarks/images/changed_sturcture.png differ diff --git a/docs/proposals/scenarios/llm-benchmarks/images/opencompass.png b/docs/proposals/scenarios/llm-benchmarks/images/opencompass.png new file mode 100644 index 00000000..393bac4c Binary files /dev/null and b/docs/proposals/scenarios/llm-benchmarks/images/opencompass.png differ diff --git a/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks-zh.md b/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks-zh.md index 57b31349..eb951b5a 100644 --- a/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks-zh.md +++ b/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks-zh.md @@ -131,6 +131,14 @@ simple_qa 是我设计的一个简单的QA问答任务,测试集部分内容 ![](images/data_process_change.png) +![](images/changed_sturcture.png) + +具体在core里面修改的部分 `core/testenvmanager/dataset`: + +![](images/changed_part.png) + +值得注意的是,该设计同时也兼容对旧版的index数据的支持。仅仅只需要将旧版的train_url和test_url字段改成train_index和test_index即可。 + 在之前的项目中,我们需要在 `testenv.yaml` 文件中配置 `train_url` 和 `test_url` 索引文件的路径,索引文件中会放 (输入x, 期望输出y) 的文件路径,这个设计是存在一些局限性的。 以往的ianvs项目似乎是以cv为主,对于nlp的example似乎没有,所以数据集读取方面,一般采用的方式是写一个index.txt,这个文件里面放(数据,标注)对,然后又有一个放数据的文件夹,一个放标注的文件夹,一个文件就是一个数据。一张图片一个数据,这在cv领域是可以理解的,但是如果换到nlp领域,没法做到一个txt文件只放一个数据,一般的做法是连index文件都不需要,直接一个data.json/jsonl/文件就把数据和标签都写进去了,例如 @@ -409,7 +417,7 @@ BenchMark 的相关信息数据都需要设计成单独存储,以保持稳定 } ``` -如果需要有别的prompt信息,也可以加进去。 +这里的数据是可扩展的,包括keys和prompts,如果需要有别的信息也可以加进去。 至于是使用 ZeroShot/OneShot/FewShot,其实都是用增加 chat message history 的方式,这部分由不同模型自己实现即可。 diff --git a/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks.md b/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks.md index b8c6173b..1d98f438 100644 --- a/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks.md +++ b/docs/proposals/scenarios/llm-benchmarks/llm-benchmarks.md @@ -155,6 +155,14 @@ Additionally, this is different from the usual method of using an `index.txt` in ![](images/data_process_change.png) +![](images/changed_sturcture.png) + +I modified `core/testenvmanager/dataset` in core: + +![](images/changed_part.png) + +It should be noticed that, this design is compatible to older version data-reading, you only need to change train_url and test_url to train_index and test_index. + In previous projects, we needed to configure the paths of the `train_url` and `test_url` index files in the `testenv.yaml` file. The index files would contain file paths for (input x, expected output y) pairs, and this design has some limitations. The previous Ianvs projects seem to be focused on computer vision (CV), and there do not appear to be examples for natural language processing (NLP), so in terms of dataset reading, the common approach is to write an `index.txt` file that contains (data, annotation) pairs. There would also be a folder for data, a folder for annotations, and each file represents a single piece of data. An image corresponds to a single data point, which is understandable in the CV field. However, if you switch to the NLP field, you can't have a single text file contain just one piece of data. The common practice is to not even need an index file, but to write both data and labels directly into a data.json or jsonl file. For example: @@ -439,6 +447,8 @@ Provide a prompt template for model inference. For example: } ``` +The data here is extendable, you can add more prompts or keys if you want to or need to. + If there is a need for additional prompt information, it can also be incorporated. As for whether to use ZeroShot/OneShot/FewShot, it essentially involves adding to the chat message history, and this part can be implemented by different models themselves. diff --git a/docs/proposals/scenarios/llm-benchmarks/opencompass-tutorial.md b/docs/proposals/scenarios/llm-benchmarks/opencompass-tutorial.md new file mode 100644 index 00000000..f6a096be --- /dev/null +++ b/docs/proposals/scenarios/llm-benchmarks/opencompass-tutorial.md @@ -0,0 +1,75 @@ +# OpenCompass Tutorial + +Github Repo: + +https://github.com/open-compass/opencompass/ + +## Introduction + +![](./images/opencompass.png) + +OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: + +- Comprehensive support for models and datasets: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. + +- Efficient distributed evaluation: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. + +- Diversified evaluation paradigms: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models. + +- Modular design with high extensibility: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded! + +- Experiment management and reporting mechanism: Use config files to fully record each experiment, and support real-time reporting of results. + +In a nutshell, OpenCompass supports the evaluation of most mainstream large models on mainstream benchmarks, and it is very convenient to configure and run evaluations on multiple datasets across multiple models with just one click. + +## QuickStart + +The evaluation of OpenCompass depends on the configuration file, which includes the model section and the dataset (i.e., benchmark) section. Below, I will explain with an example. + +In [`configs/eval_chat_demo.py`](https://github.com/open-compass/opencompass/blob/main/configs/eval_chat_demo.py), it shows: + +```python +from mmengine.config import read_base + +with read_base(): + from .datasets.demo.demo_gsm8k_chat_gen import gsm8k_datasets + from .datasets.demo.demo_math_chat_gen import math_datasets + from .models.qwen.hf_qwen2_1_5b_instruct import models as hf_qwen2_1_5b_instruct_models + from .models.hf_internlm.hf_internlm2_chat_1_8b import models as hf_internlm2_chat_1_8b_models + +datasets = gsm8k_datasets + math_datasets +models = hf_qwen2_1_5b_instruct_models + hf_internlm2_chat_1_8b_models +``` + +This means the BenchMarks are gsm8k_datasets and math_datasets, and the models are hf_qwen2_1_5b_instruct_models and hf_internlm2_chat_1_8b_models. + +For the detailed configurations, you can look up in `configs/datasets` and `configs/models` for full information about this. + +For example, in `configs/models/qwen/hf_qwen2_1_5b_instruct.py`: + +```python +from opencompass.models import HuggingFacewithChatTemplate + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='qwen2-1.5b-instruct-hf', + path='Qwen/Qwen2-1.5B-Instruct', + max_out_len=1024, + batch_size=8, + run_cfg=dict(num_gpus=1), + ) +] +``` + +It shows the model name, model path, max_out_len, inference batch size and gpu_nums. + +You can modify the config as you want. + +And you can run OpenCompass with only one command: `python run.py configs/eval_demo.py -w outputs/demo --debug` + +This will run the `configs/eval_demo.py` config file, and the outputs will be put in `outputs/demo` + +You can change the config to change the BenchMarks and the models. It is very simple to use. + +For more detailed document, you can click [official doc](https://opencompass.readthedocs.io/). \ No newline at end of file