-
Notifications
You must be signed in to change notification settings - Fork 46
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #159 from safe-b/dev
OSPP: Smart Coding benchmark suite: built on KubeEdge-lanvs
- Loading branch information
Showing
14 changed files
with
1,266 additions
and
594 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 114 additions & 0 deletions
114
examples/smart_coding/smart_coding_learning_bench/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# Smart_coding BenchMark | ||
|
||
## Introduction | ||
|
||
This is the work for Domain-specific Large Model Benchmark: | ||
|
||
Build a test suite for code large models, including test datasets, evaluation metrics, test environments, and usage guidelines. | ||
|
||
The benchmark consists of two parts: comment data and issue data. | ||
|
||
## Design | ||
|
||
### Metadata Format | ||
|
||
| Name | Field Name | Option | Description | | ||
| --- | --- | --- | --- | | ||
| Data Name | dataset | Required | Name of the dataset | | ||
| Data Description | description | Optional | Dataset description, such as usage scope, sample size, etc. | | ||
| First-level Dimension | level_1_dim | Required | Should fill in "Single Modal" or "Multi-Modal" | | ||
| Second-level Dimension | level_2_dim | Required | For "Single Modal", fill in "Text", "Image", or "Audio". For "Multi-Modal", fill in "Text-Image", "Text-Audio", "Image-Audio", or "Text-Image-Audio" | | ||
| Third-level Dimension | level_3_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document | | ||
| Fourth-level Dimension | level_4_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document | | ||
|
||
metadata example: | ||
|
||
```json | ||
{ | ||
"dataset": "Code_comment BenchMark", | ||
"description": "xxx", | ||
"level_1_dim": "single-modal", | ||
"level_2_dim": "text", | ||
"level_3_dim": "Q&A", | ||
"level_4_dim": "code_comment" | ||
} | ||
``` | ||
|
||
### Data format: | ||
|
||
| name |Option|information| | ||
|--------------|---|---| | ||
| prompt |Optional|the background of the LLM testing| | ||
| query |Required|the testing question| | ||
| response |Required|the answer of the question| | ||
| explanation |Optional|the explanation of the answer| | ||
| judge_prompt |Optional|the prompt of the judge model| | ||
| level_1_dim |Optional|single-modal or multi-modal| | ||
| level_2_dim |Optional|single-modal: text, image, video; multi-modal: text-image, text-video, text-image-video| | ||
| level_3_dim |Required|details| | ||
| level_4_dim |Required|details| | ||
|
||
data example: | ||
|
||
```json | ||
{ | ||
"prompt": "Please think step by step and answer the question.", | ||
"query": "Question:Here is a code function \"result = self.__custom_confs_rx.search(variable)\". Please comment this code or function.", | ||
"response": "Use regular expressions to match variable names to determine whether they match a specific configuration item format.", | ||
"judge_prompt": "xxx", | ||
"level_1_dim": "single-modal", | ||
"level_2_dim": "text", | ||
"level_3_dim": "knowledge Q&A", | ||
"level_4_dim": "code_comment" | ||
} | ||
``` | ||
|
||
|
||
## Change to Core Code | ||
|
||
![](./imgs/img.png) | ||
|
||
## Prepare Datasets | ||
|
||
You can download dataset in | ||
|
||
``` | ||
dataset/smart_code | ||
├── comments | ||
│ ├── test_data | ||
│ │ ├── data.jsonl | ||
│ │ └── metadata.json | ||
│ └── train_data | ||
└── issue | ||
├── test_data | ||
│ ├── data_full.jsonl | ||
│ ├── data.jsonl | ||
│ └── metadata.json | ||
└── train_data | ||
``` | ||
Because the ianvs use itself does not require training, data.json in train_data here is a null value file | ||
|
||
Therefore, do not add data to data.json in the train_data directory | ||
|
||
**If you want to train ianvs or add data to the data.json file of the training dataset, please make the following changes** | ||
|
||
Open corresponding file `examples/government/singletask_learning_bench/subjective/testenv/testenv.yaml` | ||
|
||
Change `train_data` to `train_data_info` and its url to the corresponding `metadata.json` path | ||
|
||
|
||
## Prepare Environment | ||
|
||
You should change your sedna package like this: [sedna repo commit](https://github.com/IcyFeather233/sedna/commit/e13b82363c03dc771fca4922a24798554ca32a9f) | ||
|
||
Or you can replace the file in `yourpath/anaconda3/envs/ianvs/lib/python3.x/site-packages/sedna` with `examples/resources/sedna-llm.zip` | ||
|
||
## Run Ianvs | ||
|
||
### Comment | ||
|
||
`ianvs -f examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml` | ||
|
||
### Issue | ||
|
||
`ianvs -f examples/smart_coding/smart_coding_learning_bench/issue/benchmarkingjob.yaml` |
68 changes: 68 additions & 0 deletions
68
examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
benchmarkingjob: | ||
# job name of bechmarking; string type; | ||
name: "benchmarkingjob" | ||
# the url address of job workspace that will reserve the output of tests; string type; | ||
workspace: "/home/xiebo/ianvs/workspace" | ||
|
||
# the url address of test environment configuration file; string type; | ||
# the file format supports yaml/yml; | ||
testenv: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml" | ||
|
||
# the configuration of test object | ||
test_object: | ||
# test type; string type; | ||
# currently the option of value is "algorithms",the others will be added in succession. | ||
type: "algorithms" | ||
# test algorithm configuration files; list type; | ||
algorithms: | ||
# algorithm name; string type; | ||
- name: "code_comment_bench_singletask_learning" | ||
# the url address of test algorithm configuration file; string type; | ||
# the file format supports yaml/yml; | ||
url: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml" | ||
|
||
# the configuration of ranking leaderboard | ||
rank: | ||
# rank leaderboard with metric of test case's evaluation and order ; list type; | ||
# the sorting priority is based on the sequence of metrics in the list from front to back; | ||
sort_by: [ { "llm_judgement": "descend" } ] | ||
|
||
# visualization configuration | ||
visualization: | ||
# mode of visualization in the leaderboard; string type; | ||
# There are quite a few possible dataitems in the leaderboard. Not all of them can be shown simultaneously on the screen. | ||
# In the leaderboard, we provide the "selected_only" mode for the user to configure what is shown or is not shown. | ||
mode: "selected_only" | ||
# method of visualization for selected dataitems; string type; | ||
# currently the options of value are as follows: | ||
# 1> "print_table": print selected dataitems; | ||
method: "print_table" | ||
|
||
# selected dataitem configuration | ||
# The user can add his/her interested dataitems in terms of "paradigms", "modules", "hyperparameters" and "metrics", | ||
# so that the selected columns will be shown. | ||
selected_dataitem: | ||
# currently the options of value are as follows: | ||
# 1> "all": select all paradigms in the leaderboard; | ||
# 2> paradigms in the leaderboard, e.g., "singletasklearning" | ||
paradigms: [ "all" ] | ||
# currently the options of value are as follows: | ||
# 1> "all": select all modules in the leaderboard; | ||
# 2> modules in the leaderboard, e.g., "basemodel" | ||
modules: [ "all" ] | ||
# currently the options of value are as follows: | ||
# 1> "all": select all hyperparameters in the leaderboard; | ||
# 2> hyperparameters in the leaderboard, e.g., "momentum" | ||
hyperparameters: [ "all" ] | ||
# currently the options of value are as follows: | ||
# 1> "all": select all metrics in the leaderboard; | ||
# 2> metrics in the leaderboard, e.g., "f1_score" | ||
metrics: [ "llm_judgement" ] | ||
|
||
# model of save selected and all dataitems in workspace; string type; | ||
# currently the options of value are as follows: | ||
# 1> "selected_and_all": save selected and all dataitems; | ||
# 2> "selected_only": save selected dataitems; | ||
save_mode: "selected_and_all" | ||
|
||
|
129 changes: 129 additions & 0 deletions
129
examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Copyright 2022 The KubeEdge Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from __future__ import absolute_import, division | ||
|
||
import os | ||
import tempfile | ||
import time | ||
import zipfile | ||
import logging | ||
|
||
import numpy as np | ||
import random | ||
from tqdm import tqdm | ||
from sedna.common.config import Context | ||
from sedna.common.class_factory import ClassType, ClassFactory | ||
from core.common.log import LOGGER | ||
from openai import OpenAI | ||
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
||
device = "cuda" # the device to load the model onto | ||
|
||
logging.disable(logging.WARNING) | ||
|
||
__all__ = ["BaseModel"] | ||
|
||
os.environ['BACKEND_TYPE'] = 'TORCH' | ||
|
||
|
||
@ClassFactory.register(ClassType.GENERAL, alias="gen") | ||
class BaseModel: | ||
|
||
def __init__(self, **kwargs): | ||
self.model = AutoModelForCausalLM.from_pretrained( | ||
"/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct", | ||
torch_dtype="auto", | ||
device_map="auto" | ||
) | ||
self.tokenizer = AutoTokenizer.from_pretrained("/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct") | ||
|
||
def train(self, train_data, valid_data=None, **kwargs): | ||
LOGGER.info("BaseModel train") | ||
|
||
def save(self, model_path): | ||
LOGGER.info("BaseModel save") | ||
|
||
def predict(self, data, input_shape=None, **kwargs): | ||
LOGGER.info("BaseModel predict") | ||
LOGGER.info(f"Dataset: {data.dataset_name}") | ||
LOGGER.info(f"Description: {data.description}") | ||
LOGGER.info(f"Data Level 1 Dim: {data.level_1_dim}") | ||
LOGGER.info(f"Data Level 2 Dim: {data.level_2_dim}") | ||
|
||
answer_list = [] | ||
for line in tqdm(data.x, desc="Processing", unit="question"): | ||
history = [] | ||
history.append({"role": "user", "content": line}) | ||
response = self._infer(history) | ||
answer_list.append(response) | ||
|
||
judgement_list = [] | ||
|
||
# evaluate by llm | ||
for index in tqdm(range(len(answer_list)), desc="Evaluating", ascii=False, ncols=75): | ||
prompt = data.judge_prompts[index] + answer_list[index] | ||
judgement = self._openai_generate(prompt) | ||
judgement_list.append(judgement) | ||
|
||
return judgement_list | ||
|
||
def load(self, model_url=None): | ||
LOGGER.info("BaseModel load") | ||
|
||
def evaluate(self, data, model_path, **kwargs): | ||
LOGGER.info("BaseModel evaluate") | ||
|
||
def _infer(self, messages): | ||
text = self.tokenizer.apply_chat_template( | ||
messages, | ||
tokenize=False, | ||
add_generation_prompt=True | ||
) | ||
model_inputs = self.tokenizer([text], return_tensors="pt").to(device) | ||
|
||
generated_ids = self.model.generate( | ||
model_inputs.input_ids, | ||
max_new_tokens=512, | ||
temperature=0.1, | ||
top_p=0.9 | ||
) | ||
generated_ids = [ | ||
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) | ||
] | ||
|
||
response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
return response | ||
|
||
def _openai_generate(self, user_question, system=None): | ||
key = os.getenv("DEEPSEEK_API_KEY") | ||
if not key: | ||
raise ValueError("You should set DEEPSEEK_API_KEY in your env.") | ||
client = OpenAI(api_key=key, base_url="https://api.deepseek.com") | ||
|
||
messages = [] | ||
if system: | ||
messages.append({"role": "system", "content": system}) | ||
messages.append({"role": "user", "content": user_question}) | ||
|
||
response = client.chat.completions.create( | ||
model="deepseek-chat", | ||
messages=messages, | ||
stream=False | ||
) | ||
|
||
res = response.choices[0].message.content | ||
|
||
return res |
18 changes: 18 additions & 0 deletions
18
...es/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
algorithm: | ||
# paradigm name; string type; | ||
# currently the options of value are as follows: | ||
# 1> "singletasklearning" | ||
# 2> "incrementallearning" | ||
paradigm_type: "singletasklearning" | ||
|
||
# algorithm module configuration in the paradigm; list type; | ||
modules: | ||
# kind of algorithm module; string type; | ||
# currently the options of value are as follows: | ||
# 1> "basemodel" | ||
- type: "basemodel" | ||
# name of python module; string type; | ||
# example: basemodel.py has BaseModel module that the alias is "FPN" for this benchmarking; | ||
name: "gen" | ||
# the url address of python module; string type; | ||
url: "./examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py" |
48 changes: 48 additions & 0 deletions
48
examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Copyright 2022 The KubeEdge Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import re | ||
from sedna.common.class_factory import ClassType, ClassFactory | ||
from core.common.log import LOGGER | ||
|
||
__all__ = ["llm_judgement"] | ||
|
||
|
||
def extract_comprehensive_score(input_str): | ||
# Use regular expressions to match composite scores and their scores | ||
match = re.search(r"'Composite score': (\d+)", input_str) | ||
if match: | ||
# Extract the score and return it | ||
return int(match.group(1)) | ||
else: | ||
# If no match is found, return None or some other appropriate value | ||
return None | ||
|
||
|
||
@ClassFactory.register(ClassType.GENERAL, alias="llm_judgement") | ||
def llm_judgement(y_true, y_pred): | ||
y_pred = [extract_comprehensive_score(pred) for pred in y_pred] | ||
|
||
# Filter out the None value (if any) | ||
valid_scores = [score for score in y_pred if score is not None] | ||
|
||
LOGGER.info(f"Extracted {len(valid_scores)} datas from {len(y_pred)} datas") | ||
|
||
# Calculate the average | ||
if valid_scores: | ||
average_score = sum(valid_scores) / len(valid_scores) | ||
return average_score | ||
else: | ||
# If there is no valid score, return None or some other appropriate value | ||
return -1 |
14 changes: 14 additions & 0 deletions
14
examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
testenv: | ||
# dataset configuration | ||
dataset: | ||
# the url address of train dataset index; string type; | ||
train_data: "/home/xiebo/ianvs/dataset/smart_coding/comment/train_data/data.jsonl" | ||
# the url address of test dataset index; string type; | ||
test_data_info: "/home/xiebo/ianvs/dataset/smart_coding/comment/test_data/metadata.json" | ||
|
||
# metrics configuration for test case's evaluation; list type; | ||
metrics: | ||
# metric name; string type; | ||
- name: "llm_judgement" | ||
# the url address of python file | ||
url: "./examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py" |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.