Merge pull request #159 from safe-b/dev

OSPP: Smart Coding benchmark suite: built on KubeEdge-lanvs
kubeedge · Oct 31, 2024 · 3fa3879 · 3fa3879
2 parents f424d35 + 2c6af20
commit 3fa3879
Show file tree

Hide file tree

Showing 14 changed files with 1,266 additions and 594 deletions.
diff --git a/core/testenvmanager/dataset/dataset.py b/core/testenvmanager/dataset/dataset.py
diff --git a/docs/guides/images/ianvs_arch.png b/docs/guides/images/ianvs_arch.png
diff --git a/examples/smart_coding/smart_coding_learning_bench/README.md b/examples/smart_coding/smart_coding_learning_bench/README.md
@@ -0,0 +1,114 @@
+# Smart_coding BenchMark
+
+## Introduction
+
+This is the work for Domain-specific Large Model Benchmark:
+
+Build a test suite for code large models, including test datasets, evaluation metrics, test environments, and usage guidelines.
+
+The benchmark consists of two parts: comment data and issue data.
+
+## Design
+
+### Metadata Format
+
+| Name | Field Name | Option | Description |
+| --- | --- | --- | --- |
+| Data Name | dataset |  Required | Name of the dataset |
+| Data Description | description | Optional | Dataset description, such as usage scope, sample size, etc. |
+| First-level Dimension | level_1_dim | Required | Should fill in "Single Modal" or "Multi-Modal" |
+| Second-level Dimension | level_2_dim | Required | For "Single Modal", fill in "Text", "Image", or "Audio". For "Multi-Modal", fill in "Text-Image", "Text-Audio", "Image-Audio", or "Text-Image-Audio" |
+| Third-level Dimension | level_3_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document |
+| Fourth-level Dimension | level_4_dim | Optional | Should be filled if all samples in the dataset have the same third-level dimension. If filled, content should be based on the standards shown in the normative reference document |
+
+metadata example:
+
+```json
+{
+    "dataset": "Code_comment BenchMark",
+    "description": "xxx",
+    "level_1_dim": "single-modal",
+    "level_2_dim": "text",
+    "level_3_dim": "Q&A",
+    "level_4_dim": "code_comment"
+}
+```
+
+### Data format:
+
+| name         |Option|information|
+|--------------|---|---|
+| prompt       |Optional|the background of the LLM testing|
+| query        |Required|the testing question|
+| response     |Required|the answer of the question|
+| explanation  |Optional|the explanation of the answer|
+| judge_prompt |Optional|the prompt of the judge model|
+| level_1_dim  |Optional|single-modal or multi-modal|
+| level_2_dim  |Optional|single-modal: text, image, video; multi-modal: text-image, text-video, text-image-video|
+| level_3_dim  |Required|details|
+| level_4_dim  |Required|details|
+
+data example:
+
+```json
+{
+    "prompt": "Please think step by step and answer the question.",
+    "query": "Question：Here is a code function \"result = self.__custom_confs_rx.search(variable)\". Please comment this code or function.",
+    "response": "Use regular expressions to match variable names to determine whether they match a specific configuration item format.",
+    "judge_prompt": "xxx",
+    "level_1_dim": "single-modal",
+    "level_2_dim": "text",
+    "level_3_dim": "knowledge Q&A",
+    "level_4_dim": "code_comment"
+}
+```
+
+
+## Change to Core Code
+
+![](./imgs/img.png)
+
+## Prepare Datasets
+
+You can download dataset in 
+
+```
+dataset/smart_code
+├── comments
+│   ├── test_data
+│   │   ├── data.jsonl
+│   │   └── metadata.json
+│   └── train_data
+└── issue
+    ├── test_data
+    │   ├── data_full.jsonl
+    │   ├── data.jsonl
+    │   └── metadata.json
+    └── train_data
+```
+Because the ianvs use itself does not require training, data.json in train_data here is a null value file
+
+Therefore, do not add data to data.json in the train_data directory
+
+**If you want to train ianvs or add data to the data.json file of the training dataset, please make the following changes**
+
+Open corresponding file `examples/government/singletask_learning_bench/subjective/testenv/testenv.yaml`
+
+Change `train_data` to `train_data_info` and its url to the corresponding `metadata.json` path
+
+
+## Prepare Environment
+
+You should change your sedna package like this: [sedna repo commit](https://github.com/IcyFeather233/sedna/commit/e13b82363c03dc771fca4922a24798554ca32a9f)
+
+Or you can replace the file in `yourpath/anaconda3/envs/ianvs/lib/python3.x/site-packages/sedna` with `examples/resources/sedna-llm.zip`
+
+## Run Ianvs
+
+### Comment
+
+`ianvs -f examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml`
+
+### Issue
+
+`ianvs -f examples/smart_coding/smart_coding_learning_bench/issue/benchmarkingjob.yaml`
diff --git a/examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml b/examples/smart_coding/smart_coding_learning_bench/comment/benchmarkingjob.yaml
@@ -0,0 +1,68 @@
+benchmarkingjob:
+  # job name of bechmarking; string type;
+  name: "benchmarkingjob"
+  # the url address of job workspace that will reserve the output of tests; string type;
+  workspace: "/home/xiebo/ianvs/workspace"
+
+  # the url address of test environment configuration file; string type;
+  # the file format supports yaml/yml;
+  testenv: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml"
+
+  # the configuration of test object
+  test_object:
+    # test type; string type;
+    # currently the option of value is "algorithms",the others will be added in succession.
+    type: "algorithms"
+    # test algorithm configuration files; list type;
+    algorithms:
+      # algorithm name; string type;
+      - name: "code_comment_bench_singletask_learning"
+        # the url address of test algorithm configuration file; string type;
+        # the file format supports yaml/yml;
+        url: "/home/xiebo/ianvs/examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml"
+
+  # the configuration of ranking leaderboard
+  rank:
+    # rank leaderboard with metric of test case's evaluation and order ; list type;
+    # the sorting priority is based on the sequence of metrics in the list from front to back;
+    sort_by: [ { "llm_judgement": "descend" } ]
+
+    # visualization configuration
+    visualization:
+      # mode of visualization in the leaderboard; string type;
+      # There are quite a few possible dataitems in the leaderboard. Not all of them can be shown simultaneously on the screen.
+      # In the leaderboard, we provide the "selected_only" mode for the user to configure what is shown or is not shown.
+      mode: "selected_only"
+      # method of visualization for selected dataitems; string type;
+      # currently the options of value are as follows:
+      #  1> "print_table": print selected dataitems;
+      method: "print_table"
+
+    # selected dataitem configuration
+    # The user can add his/her interested dataitems in terms of "paradigms", "modules", "hyperparameters" and "metrics",
+    # so that the selected columns will be shown.
+    selected_dataitem:
+      # currently the options of value are as follows:
+      #   1> "all": select all paradigms in the leaderboard;
+      #   2> paradigms in the leaderboard, e.g., "singletasklearning"
+      paradigms: [ "all" ]
+      # currently the options of value are as follows:
+      #   1> "all": select all modules in the leaderboard;
+      #   2> modules in the leaderboard, e.g., "basemodel"
+      modules: [ "all" ]
+      # currently the options of value are as follows:
+      #   1> "all": select all hyperparameters in the leaderboard;
+      #   2> hyperparameters in the leaderboard, e.g., "momentum"
+      hyperparameters: [ "all" ]
+      # currently the options of value are as follows:
+      #   1> "all": select all metrics in the leaderboard;
+      #   2> metrics in the leaderboard, e.g., "f1_score"
+      metrics: [ "llm_judgement" ]
+
+    # model of save selected and all dataitems in workspace; string type;
+    # currently the options of value are as follows:
+    #  1> "selected_and_all": save selected and all dataitems;
+    #  2> "selected_only": save selected dataitems;
+    save_mode: "selected_and_all"
+
+
diff --git a/examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py b/examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py
@@ -0,0 +1,129 @@
+# Copyright 2022 The KubeEdge Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division
+
+import os
+import tempfile
+import time
+import zipfile
+import logging
+
+import numpy as np
+import random
+from tqdm import tqdm
+from sedna.common.config import Context
+from sedna.common.class_factory import ClassType, ClassFactory
+from core.common.log import LOGGER
+from openai import OpenAI
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+device = "cuda"  # the device to load the model onto
+
+logging.disable(logging.WARNING)
+
+__all__ = ["BaseModel"]
+
+os.environ['BACKEND_TYPE'] = 'TORCH'
+
+
+@ClassFactory.register(ClassType.GENERAL, alias="gen")
+class BaseModel:
+
+    def __init__(self, **kwargs):
+        self.model = AutoModelForCausalLM.from_pretrained(
+            "/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct",
+            torch_dtype="auto",
+            device_map="auto"
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained("/home/xiebo/model/Qwen2.5-Coder-1.5B-Instruct")
+
+    def train(self, train_data, valid_data=None, **kwargs):
+        LOGGER.info("BaseModel train")
+
+    def save(self, model_path):
+        LOGGER.info("BaseModel save")
+
+    def predict(self, data, input_shape=None, **kwargs):
+        LOGGER.info("BaseModel predict")
+        LOGGER.info(f"Dataset: {data.dataset_name}")
+        LOGGER.info(f"Description: {data.description}")
+        LOGGER.info(f"Data Level 1 Dim: {data.level_1_dim}")
+        LOGGER.info(f"Data Level 2 Dim: {data.level_2_dim}")
+
+        answer_list = []
+        for line in tqdm(data.x, desc="Processing", unit="question"):
+            history = []
+            history.append({"role": "user", "content": line})
+            response = self._infer(history)
+            answer_list.append(response)
+
+        judgement_list = []
+
+        # evaluate by llm
+        for index in tqdm(range(len(answer_list)), desc="Evaluating", ascii=False, ncols=75):
+            prompt = data.judge_prompts[index] + answer_list[index]
+            judgement = self._openai_generate(prompt)
+            judgement_list.append(judgement)
+
+        return judgement_list
+
+    def load(self, model_url=None):
+        LOGGER.info("BaseModel load")
+
+    def evaluate(self, data, model_path, **kwargs):
+        LOGGER.info("BaseModel evaluate")
+
+    def _infer(self, messages):
+        text = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True
+        )
+        model_inputs = self.tokenizer([text], return_tensors="pt").to(device)
+
+        generated_ids = self.model.generate(
+            model_inputs.input_ids,
+            max_new_tokens=512,
+            temperature=0.1,
+            top_p=0.9
+        )
+        generated_ids = [
+            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+        ]
+
+        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+        return response
+
+    def _openai_generate(self, user_question, system=None):
+        key = os.getenv("DEEPSEEK_API_KEY")
+        if not key:
+            raise ValueError("You should set DEEPSEEK_API_KEY in your env.")
+        client = OpenAI(api_key=key, base_url="https://api.deepseek.com")
+
+        messages = []
+        if system:
+            messages.append({"role": "system", "content": system})
+        messages.append({"role": "user", "content": user_question})
+
+        response = client.chat.completions.create(
+            model="deepseek-chat",
+            messages=messages,
+            stream=False
+        )
+
+        res = response.choices[0].message.content
+
+        return res
diff --git a/...es/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml b/...es/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/gen_algorithm.yaml
@@ -0,0 +1,18 @@
+algorithm:
+  # paradigm name; string type;
+  # currently the options of value are as follows:
+  #   1> "singletasklearning"
+  #   2> "incrementallearning"
+  paradigm_type: "singletasklearning"
+
+  # algorithm module configuration in the paradigm; list type;
+  modules:
+    # kind of algorithm module; string type;
+    # currently the options of value are as follows:
+    #   1> "basemodel"
+    - type: "basemodel"
+      # name of python module; string type;
+      # example: basemodel.py has BaseModel module that the alias is "FPN" for this benchmarking;
+      name: "gen"
+      # the url address of python module; string type;
+      url: "./examples/smart_coding/smart_coding_learning_bench/comment/testalgorithms/gen/basemodel.py"
diff --git a/examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py b/examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py
@@ -0,0 +1,48 @@
+# Copyright 2022 The KubeEdge Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from sedna.common.class_factory import ClassType, ClassFactory
+from core.common.log import LOGGER
+
+__all__ = ["llm_judgement"]
+
+
+def extract_comprehensive_score(input_str):
+    # Use regular expressions to match composite scores and their scores
+    match = re.search(r"'Composite score': (\d+)", input_str)
+    if match:
+        # Extract the score and return it
+        return int(match.group(1))
+    else:
+        # If no match is found, return None or some other appropriate value
+        return None
+
+
+@ClassFactory.register(ClassType.GENERAL, alias="llm_judgement")
+def llm_judgement(y_true, y_pred):
+    y_pred = [extract_comprehensive_score(pred) for pred in y_pred]
+
+    # Filter out the None value (if any)
+    valid_scores = [score for score in y_pred if score is not None]
+
+    LOGGER.info(f"Extracted {len(valid_scores)} datas from {len(y_pred)} datas")
+
+    # Calculate the average
+    if valid_scores:
+        average_score = sum(valid_scores) / len(valid_scores)
+        return average_score
+    else:
+        # If there is no valid score, return None or some other appropriate value
+        return -1
diff --git a/examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml b/examples/smart_coding/smart_coding_learning_bench/comment/testenv/testenv.yaml
@@ -0,0 +1,14 @@
+testenv:
+  # dataset configuration
+  dataset:
+    # the url address of train dataset index; string type;
+    train_data: "/home/xiebo/ianvs/dataset/smart_coding/comment/train_data/data.jsonl"
+    # the url address of test dataset index; string type;
+    test_data_info: "/home/xiebo/ianvs/dataset/smart_coding/comment/test_data/metadata.json"
+
+  # metrics configuration for test case's evaluation; list type;
+  metrics:
+      # metric name; string type;
+    - name: "llm_judgement"
+      # the url address of python file
+      url: "./examples/smart_coding/smart_coding_learning_bench/comment/testenv/llm_judgement.py"
diff --git a/examples/smart_coding/smart_coding_learning_bench/imgs/img.png b/examples/smart_coding/smart_coding_learning_bench/imgs/img.png