Skip to content

Commit

Permalink
Merge pull request #159 from safe-b/dev
Browse files Browse the repository at this point in the history
OSPP: Smart Coding benchmark suite: built on KubeEdge-lanvs
  • Loading branch information
kubeedge-bot authored Oct 31, 2024
2 parents f424d35 + 2c6af20 commit 3fa3879
Show file tree
Hide file tree
Showing 14 changed files with 1,266 additions and 594 deletions.
1,188 changes: 594 additions & 594 deletions core/testenvmanager/dataset/dataset.py

Large diffs are not rendered by default.

Binary file modified docs/guides/images/ianvs_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 114 additions & 0 deletions examples/smart_coding/smart_coding_learning_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Smart_coding BenchMark

## Introduction

This is the work for Domain-specific Large Model Benchmark:

Build a test suite for code large models, including test datasets, evaluation metrics, test environments, and usage guidelines.

The benchmark consists of two parts: comment data and issue data.

## Design

### Metadata Format

| Name | Field Name | Option | Description |
| --- | --- | --- | --- |
| Data Name | dataset | Required | Name of the dataset |
| Data Description | description | Optional | Dataset description, such as usage scope, sample size, etc. |
| First-level Dimension | level_1_dim | Required | Should fill in "Single Modal" or "Multi-Modal" |
| Second-level Dimension | level_2_dim | Required | For "Single Modal", fill in "Text", "Image", or "Audio". For "Multi-Modal", fill in "Text-Image", "Text-Audio", "Image-Audio", or "Text-Image-Audio" |
| Third-level Dimension | level_3_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document |
| Fourth-level Dimension | level_4_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document |

metadata example:

```json
{
"dataset": "Code_comment BenchMark",
"description": "xxx",
"level_1_dim": "single-modal",
"level_2_dim": "text",
"level_3_dim": "Q&A",
"level_4_dim": "code_comment"
}
```

### Data format:

| name |Option|information|
|--------------|---|---|
| prompt |Optional|the background of the LLM testing|
| query |Required|the testing question|
| response |Required|the answer of the question|
| explanation |Optional|the explanation of the answer|
| judge_prompt |Optional|the prompt of the judge model|
| level_1_dim |Optional|single-modal or multi-modal|
| level_2_dim |Optional|single-modal: text, image, video; multi-modal: text-image, text-video, text-image-video|
| level_3_dim |Required|details|
| level_4_dim |Required|details|

data example:

```json
{
"prompt": "Please think step by step and answer the question.",
"query": "Question:Here is a code function \"result = self.__custom_confs_rx.search(variable)\". Please comment this code or function.",
"response": "Use regular expressions to match variable names to determine whether they match a specific configuration item format.",
"judge_prompt": "xxx",
"level_1_dim": "single-modal",
"level_2_dim": "text",
"level_3_dim": "knowledge Q&A",
"level_4_dim": "code_comment"
}
```


## Change to Core Code

![](./imgs/img.png)

## Prepare Datasets

You can download dataset in

```
dataset/smart_code
├── comments
│ ├── test_data
│ │ ├── data.jsonl
│ │ └── metadata.json
│ └── train_data
└── issue
├── test_data
│ ├── data_full.jsonl
│ ├── data.jsonl
│ └── metadata.json
└── train_data
```
Because the ianvs use itself does not require training, data.json in train_data here is a null value file

Therefore, do not add data to data.json in the train_data directory

**If you want to train ianvs or add data to the data.json file of the training dataset, please make the following changes**

Open corresponding file `examples/government/singletask_learning_bench/subjective/testenv/testenv.yaml`

Change `train_data` to `train_data_info` and its url to the corresponding `metadata.json` path


## Prepare Environment

You should change your sedna package like this: [sedna repo commit](https://github.com/IcyFeather233/sedna/commit/e13b82363c03dc771fca4922a24798554ca32a9f)

Or you can replace the file in `yourpath/anaconda3/envs/ianvs/lib/python3.x/site-packages/sedna` with `examples/resources/sedna-llm.zip`

## Run Ianvs

### Comment

`ianvs -f examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml`

### Issue

`ianvs -f examples/smart_coding/smart_coding_learning_bench/issue/benchmarkingjob.yaml`
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
benchmarkingjob:
# job name of bechmarking; string type;
name: "benchmarkingjob"
# the url address of job workspace that will reserve the output of tests; string type;
workspace: "/home/xiebo/ianvs/workspace"

# the url address of test environment configuration file; string type;
# the file format supports yaml/yml;
testenv: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml"

# the configuration of test object
test_object:
# test type; string type;
# currently the option of value is "algorithms",the others will be added in succession.
type: "algorithms"
# test algorithm configuration files; list type;
algorithms:
# algorithm name; string type;
- name: "code_comment_bench_singletask_learning"
# the url address of test algorithm configuration file; string type;
# the file format supports yaml/yml;
url: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml"

# the configuration of ranking leaderboard
rank:
# rank leaderboard with metric of test case's evaluation and order ; list type;
# the sorting priority is based on the sequence of metrics in the list from front to back;
sort_by: [ { "llm_judgement": "descend" } ]

# visualization configuration
visualization:
# mode of visualization in the leaderboard; string type;
# There are quite a few possible dataitems in the leaderboard. Not all of them can be shown simultaneously on the screen.
# In the leaderboard, we provide the "selected_only" mode for the user to configure what is shown or is not shown.
mode: "selected_only"
# method of visualization for selected dataitems; string type;
# currently the options of value are as follows:
# 1> "print_table": print selected dataitems;
method: "print_table"

# selected dataitem configuration
# The user can add his/her interested dataitems in terms of "paradigms", "modules", "hyperparameters" and "metrics",
# so that the selected columns will be shown.
selected_dataitem:
# currently the options of value are as follows:
# 1> "all": select all paradigms in the leaderboard;
# 2> paradigms in the leaderboard, e.g., "singletasklearning"
paradigms: [ "all" ]
# currently the options of value are as follows:
# 1> "all": select all modules in the leaderboard;
# 2> modules in the leaderboard, e.g., "basemodel"
modules: [ "all" ]
# currently the options of value are as follows:
# 1> "all": select all hyperparameters in the leaderboard;
# 2> hyperparameters in the leaderboard, e.g., "momentum"
hyperparameters: [ "all" ]
# currently the options of value are as follows:
# 1> "all": select all metrics in the leaderboard;
# 2> metrics in the leaderboard, e.g., "f1_score"
metrics: [ "llm_judgement" ]

# model of save selected and all dataitems in workspace; string type;
# currently the options of value are as follows:
# 1> "selected_and_all": save selected and all dataitems;
# 2> "selected_only": save selected dataitems;
save_mode: "selected_and_all"


Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Copyright 2022 The KubeEdge Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import absolute_import, division

import os
import tempfile
import time
import zipfile
import logging

import numpy as np
import random
from tqdm import tqdm
from sedna.common.config import Context
from sedna.common.class_factory import ClassType, ClassFactory
from core.common.log import LOGGER
from openai import OpenAI

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

logging.disable(logging.WARNING)

__all__ = ["BaseModel"]

os.environ['BACKEND_TYPE'] = 'TORCH'


@ClassFactory.register(ClassType.GENERAL, alias="gen")
class BaseModel:

def __init__(self, **kwargs):
self.model = AutoModelForCausalLM.from_pretrained(
"/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct",
torch_dtype="auto",
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained("/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct")

def train(self, train_data, valid_data=None, **kwargs):
LOGGER.info("BaseModel train")

def save(self, model_path):
LOGGER.info("BaseModel save")

def predict(self, data, input_shape=None, **kwargs):
LOGGER.info("BaseModel predict")
LOGGER.info(f"Dataset: {data.dataset_name}")
LOGGER.info(f"Description: {data.description}")
LOGGER.info(f"Data Level 1 Dim: {data.level_1_dim}")
LOGGER.info(f"Data Level 2 Dim: {data.level_2_dim}")

answer_list = []
for line in tqdm(data.x, desc="Processing", unit="question"):
history = []
history.append({"role": "user", "content": line})
response = self._infer(history)
answer_list.append(response)

judgement_list = []

# evaluate by llm
for index in tqdm(range(len(answer_list)), desc="Evaluating", ascii=False, ncols=75):
prompt = data.judge_prompts[index] + answer_list[index]
judgement = self._openai_generate(prompt)
judgement_list.append(judgement)

return judgement_list

def load(self, model_url=None):
LOGGER.info("BaseModel load")

def evaluate(self, data, model_path, **kwargs):
LOGGER.info("BaseModel evaluate")

def _infer(self, messages):
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = self.tokenizer([text], return_tensors="pt").to(device)

generated_ids = self.model.generate(
model_inputs.input_ids,
max_new_tokens=512,
temperature=0.1,
top_p=0.9
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response

def _openai_generate(self, user_question, system=None):
key = os.getenv("DEEPSEEK_API_KEY")
if not key:
raise ValueError("You should set DEEPSEEK_API_KEY in your env.")
client = OpenAI(api_key=key, base_url="https://api.deepseek.com")

messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": user_question})

response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
stream=False
)

res = response.choices[0].message.content

return res
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
algorithm:
# paradigm name; string type;
# currently the options of value are as follows:
# 1> "singletasklearning"
# 2> "incrementallearning"
paradigm_type: "singletasklearning"

# algorithm module configuration in the paradigm; list type;
modules:
# kind of algorithm module; string type;
# currently the options of value are as follows:
# 1> "basemodel"
- type: "basemodel"
# name of python module; string type;
# example: basemodel.py has BaseModel module that the alias is "FPN" for this benchmarking;
name: "gen"
# the url address of python module; string type;
url: "./examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py"
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Copyright 2022 The KubeEdge Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from sedna.common.class_factory import ClassType, ClassFactory
from core.common.log import LOGGER

__all__ = ["llm_judgement"]


def extract_comprehensive_score(input_str):
# Use regular expressions to match composite scores and their scores
match = re.search(r"'Composite score': (\d+)", input_str)
if match:
# Extract the score and return it
return int(match.group(1))
else:
# If no match is found, return None or some other appropriate value
return None


@ClassFactory.register(ClassType.GENERAL, alias="llm_judgement")
def llm_judgement(y_true, y_pred):
y_pred = [extract_comprehensive_score(pred) for pred in y_pred]

# Filter out the None value (if any)
valid_scores = [score for score in y_pred if score is not None]

LOGGER.info(f"Extracted {len(valid_scores)} datas from {len(y_pred)} datas")

# Calculate the average
if valid_scores:
average_score = sum(valid_scores) / len(valid_scores)
return average_score
else:
# If there is no valid score, return None or some other appropriate value
return -1
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
testenv:
# dataset configuration
dataset:
# the url address of train dataset index; string type;
train_data: "/home/xiebo/ianvs/dataset/smart_coding/comment/train_data/data.jsonl"
# the url address of test dataset index; string type;
test_data_info: "/home/xiebo/ianvs/dataset/smart_coding/comment/test_data/metadata.json"

# metrics configuration for test case's evaluation; list type;
metrics:
# metric name; string type;
- name: "llm_judgement"
# the url address of python file
url: "./examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py"
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 3fa3879

Please sign in to comment.