added auto-HPO feature with WandB (#65)

* added auto-HPO feature with WandB * added auto-HPO feature with WandB. - The core modifications are in tools/hpo, and data_juicer/config.py. - The others are from pre-commit run * minor fix for relative import * fix according to yilun's comments * fix according to yilun's comments * fix according to yilun's comments
modelscope · Nov 8, 2023 · e18292a · e18292a
1 parent c938c15
commit e18292a
Show file tree

Hide file tree

Showing 105 changed files with 1,999 additions and 1,252 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -113,4 +113,4 @@ body:
   - type: textarea
     attributes:
       label: Additional 额外信息
-      description: Anything else you would like to share? 其他您想分享的信息。
+      description: Anything else you would like to share? 其他您想分享的信息。
diff --git a/.github/ISSUE_TEMPLATE/custom.md b/.github/ISSUE_TEMPLATE/custom.md
@@ -6,5 +6,3 @@ labels: ''
 assignees: ''
 
 ---
-
-
diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml
@@ -49,4 +49,4 @@ body:
         (Optional) We encourage you to submit a [Pull Request](https://github.com/alibaba/data-juicer/pulls) (PR) to help improve Data-Juicer for everyone, especially if you have a good understanding of how to implement a fix or feature.
         （可选项）我们鼓励您提交一个 [Pull Request (PR)]([Pull Request](https://github.com/alibaba/data-juicer/pulls)) 来为开源社区提升 Data-Juicer 的能力，尤其是如果您对如何实现或者修复一个功能有比较不错的理解的时候~
       options:
-        - label: Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！
+        - label: Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！
diff --git a/.github/ISSUE_TEMPLATE/question.yml b/.github/ISSUE_TEMPLATE/question.yml
@@ -48,4 +48,4 @@ body:
   - type: textarea
     attributes:
       label: Additional 额外信息
-      description: Anything else you would like to share? 其他您想分享的信息。
+      description: Anything else you would like to share? 其他您想分享的信息。
diff --git a/.gitignore b/.gitignore
@@ -13,4 +13,5 @@ dist
 # others
 .DS_Store
 .idea/
+wandb/
 __pycache__
diff --git a/LICENSE b/LICENSE
@@ -417,5 +417,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-
-
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-English | [**中文**](README_ZH.md) 
+English | [**中文**](README_ZH.md)
 
 # Data-Juicer:  A One-Stop Data Processing System for Large Language Models
 
@@ -26,7 +26,7 @@ English | [**中文**](README_ZH.md)
 [![QualityClassifier](https://img.shields.io/badge/Tools-Quality_Classifier-saddlebrown?logo=Markdown)](tools/quality_classifier/README.md)
 [![AutoEvaluation](https://img.shields.io/badge/Tools-Auto_Evaluation-saddlebrown?logo=Markdown)](tools/evaluator/README.md)
 
-Data-Juicer is a one-stop data processing system to make data higher-quality, 
+Data-Juicer is a one-stop data processing system to make data higher-quality,
 juicier, and more digestible for LLMs.
 This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in promoting LLM data development and research!
 
@@ -68,22 +68,22 @@ Table of Contents
 
 ![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)
 
-- **Systematic & Reusable**: 
-  Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich 
-  dedicated [toolkits](#documentation), designed to 
+- **Systematic & Reusable**:
+  Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
+  dedicated [toolkits](#documentation), designed to
   function independently of specific LLM datasets and processing pipelines.
 
-- **Data-in-the-loop**: Allowing detailed data analyses with an automated 
+- **Data-in-the-loop**: Allowing detailed data analyses with an automated
   report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
   ![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)
 
-- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data 
-  processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on 
-  reference LLaMA models.  
+- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
+  processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
+  reference LLaMA models.
   ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
 
-- **Enhanced Efficiency**: Providing a speedy data processing pipeline 
-  requiring less memory and CPU usage, optimized for maximum productivity. 
+- **Enhanced Efficiency**: Providing a speedy data processing pipeline
+  requiring less memory and CPU usage, optimized for maximum productivity.
   ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
 
 
@@ -137,13 +137,13 @@ pip install py-data-juicer
 
 ### Using Docker
 
-- You can 
+- You can
   - either pull our pre-built image from DockerHub:
     ```shell
     docker pull datajuicer/data-juicer:<version_tag>
     ```
 
-  - or run the following command to build the docker image including the 
+  - or run the following command to build the docker image including the
     latest `data-juicer` with provided [Dockerfile](Dockerfile):
 
     ```shell

diff --git a/README_ZH.md b/README_ZH.md
@@ -76,7 +76,7 @@ Data-Juicer 是一个一站式数据处理系统，旨在为大语言模型 (LLM
 * **效率增强**：提供高效的数据处理流水线，减少内存占用和CPU开销，提高生产力。  ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
 
 * **用户友好**：设计简单易用，提供全面的[文档](#documentation)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md)，并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
-  
+
 * **灵活 & 易扩展**：支持大多数数据格式（如jsonl、parquet、csv等），并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子)，以执行定制化的数据处理。
 
 
@@ -301,7 +301,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
 
 我们非常欢迎贡献新功能、修复漏洞以及讨论。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。
 
-欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。 
+欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
 
 ## 参考文献
 如果您发现我们的工作对您的研发有帮助，请引用以下[论文](https://arxiv.org/abs/2309.02033) 。
@@ -315,4 +315,4 @@ eprint={2309.02033},
 archivePrefix={arXiv},
 primaryClass={cs.LG}
 }
-```
+```
diff --git a/app.py b/app.py
@@ -230,7 +230,7 @@ class Visualize:
     @staticmethod
     def filter_dataset(dataset):
         if Fields.stats not in dataset.features:
-            return 
+            return
         text_key = st.session_state.get('text_key', 'text')
         text = dataset[text_key]
         stats = pd.DataFrame(dataset[Fields.stats])

diff --git a/configs/README.md b/configs/README.md
@@ -29,4 +29,4 @@ We have reproduced the processing flow of some RedPajama datasets. Please refer
 We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](reproduced_bloom) folder for details.
 
 ### Data-Juicer Recipes
-We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
+We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
diff --git a/configs/README_ZH.md b/configs/README_ZH.md
@@ -30,4 +30,4 @@ Demo 配置文件用于帮助用户快速熟悉 Data-Juicer 的基本功能，
 我们已经重现了部分 BLOOM 数据集的处理流程，请参阅 [reproduced_bloom](reproduced_bloom) 文件夹以获取详细说明。
 
 ### Data-Juicer 菜谱
-我们使用 Data-Juicer 更细致地处理了一些开源数据集（包含 CFT 数据集），并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
+我们使用 Data-Juicer 更细致地处理了一些开源数据集（包含 CFT 数据集），并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
diff --git a/configs/data_juicer_recipes/alpaca_cot/README.md b/configs/data_juicer_recipes/alpaca_cot/README.md
@@ -6,7 +6,7 @@ This folder contains some configuration files to allow users to easily and quick
 
 The raw data files can be downloaded from [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) on HuggingFace.
 
-### Convert raw Alpaca-CoT data to jsonl 
+### Convert raw Alpaca-CoT data to jsonl
 Use [raw_alpaca_cot_merge_add_meta.py](../../../tools/preprocess/raw_alpaca_cot_merge_add_meta.py) to select `instruction`, `input` and `output` columns and merge them to `text` field with a space, and add extra [ META ]( #meta_info) info to dataset:
 
 ```shell
@@ -66,7 +66,7 @@ Each sample in refined data of Alpaca-CoT contains meta info listed as below:
   * `CFT-SR`: tagged as Single-round Dialog datasets
 
   * `CFT-MR`: tagged as Multi-round Dialog datasets
-  
+
   * `CFT-P`: tagged as Preference datasets
 
 
@@ -111,4 +111,4 @@ Each sample in refined data of Alpaca-CoT contains meta info listed as below:
 | StackExchange        	| MT    	| COL 	| EN    	| StackExchange        	|     	| ✅       	|         	| ✅              	|
 | ConvAI2              	| TS    	| HG  	| EN    	| ConvAI2              	|     	| ✅       	|         	|                	|
 | FastChat             	| MT    	| SI  	| EN    	| FastChat             	|     	| ✅       	|         	|                	|
-| Tabular-LLM-Data     	| MT    	| COL 	| EN/CN 	| Tabular-LLM-Data     	| ✅   	|         	|         	|                	|
+| Tabular-LLM-Data     	| MT    	| COL 	| EN/CN 	| Tabular-LLM-Data     	| ✅   	|         	|         	|                	|
diff --git a/configs/data_juicer_recipes/alpaca_cot/README_ZH.md b/configs/data_juicer_recipes/alpaca_cot/README_ZH.md
@@ -111,4 +111,4 @@ python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alp
 | StackExchange        	| MT    	| COL      	| EN    	| StackExchange        	|     	| ✅      	|        	| ✅      	|
 | ConvAI2              	| TS    	| HG       	| EN    	| ConvAI2              	|     	| ✅      	|        	|        	|
 | FastChat             	| MT    	| SI       	| EN    	| FastChat             	|     	| ✅      	|        	|        	|
-| Tabular-LLM-Data     	| MT    	| COL      	| EN/CN 	| Tabular-LLM-Data     	| ✅   	|        	|        	|        	|
+| Tabular-LLM-Data     	| MT    	| COL      	| EN/CN 	| Tabular-LLM-Data     	| ✅   	|        	|        	|        	|
diff --git a/configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml b/configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
@@ -10,23 +10,23 @@ open_tracer: true
 # a list of several process operators with their arguments
 process:
   - document_deduplicator: # 104636705
-      lowercase: true 
+      lowercase: true
       ignore_non_character: true
-      
+
   - alphanumeric_filter: # 104636381
       tokenization: false
-      min_ratio: 0.1  
+      min_ratio: 0.1
   - character_repetition_filter: # 104630030
       rep_len: 10
-      max_ratio: 0.6  
+      max_ratio: 0.6
   - flagged_words_filter: # 104576967
       lang: en
       tokenization: true
-      max_ratio: 0.017  
+      max_ratio: 0.017
   - maximum_line_length_filter: # 104575811
       min_len: 20
   - text_length_filter: # 104573711
-      min_len: 30 
+      min_len: 30
 
   - document_simhash_deduplicator:  # 72855345
       tokenization: space

diff --git a/configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml b/configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml
@@ -15,10 +15,10 @@ process:
 
   - alphanumeric_filter: # 16957388
       tokenization: false
-      min_ratio: 0.10  
+      min_ratio: 0.10
   - character_repetition_filter: # 16956845
       rep_len: 10
-      max_ratio: 0.6  
+      max_ratio: 0.6
   - flagged_words_filter: # 16954629
       lang: zh
       tokenization: true

diff --git a/configs/data_juicer_recipes/redpajama-c4-refine.yaml b/configs/data_juicer_recipes/redpajama-c4-refine.yaml
@@ -49,4 +49,4 @@ process:
       lowercase: true
       ignore_pattern: '\p{P}'
       num_blocks: 6
-      hamming_distance: 4
+      hamming_distance: 4
diff --git a/data_juicer/analysis/overall_analysis.py b/data_juicer/analysis/overall_analysis.py
@@ -3,6 +3,8 @@
 import pandas as pd
 
 from data_juicer.utils.constant import Fields
+
+
 class OverallAnalysis:
     """Apply analysis on the overall stats, including mean, std, quantiles,
     etc."""
Original file line number	Diff line number	Diff line change
Expand Up		@@ -417,5 +417,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
		WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
		See the License for the specific language governing permissions and
		limitations under the License.