Skip to content

Commit

Permalink
added auto-HPO feature with WandB (#65)
Browse files Browse the repository at this point in the history
* added auto-HPO feature with WandB

* added auto-HPO feature with WandB.
- The core modifications are in tools/hpo, and data_juicer/config.py.
- The others are from pre-commit run

* minor fix for relative import

* fix according to yilun's comments

* fix according to yilun's comments

* fix according to yilun's comments
  • Loading branch information
yxdyc authored Nov 8, 2023
1 parent c938c15 commit e18292a
Show file tree
Hide file tree
Showing 105 changed files with 1,999 additions and 1,252 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,4 @@ body:
- type: textarea
attributes:
label: Additional 额外信息
description: Anything else you would like to share? 其他您想分享的信息。
description: Anything else you would like to share? 其他您想分享的信息。
2 changes: 0 additions & 2 deletions .github/ISSUE_TEMPLATE/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,3 @@ labels: ''
assignees: ''

---


2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ body:
(Optional) We encourage you to submit a [Pull Request](https://github.com/alibaba/data-juicer/pulls) (PR) to help improve Data-Juicer for everyone, especially if you have a good understanding of how to implement a fix or feature.
(可选项)我们鼓励您提交一个 [Pull Request (PR)]([Pull Request](https://github.com/alibaba/data-juicer/pulls)) 来为开源社区提升 Data-Juicer 的能力,尤其是如果您对如何实现或者修复一个功能有比较不错的理解的时候~
options:
- label: Yes I'd like to help by submitting a PR! 是的!我愿意提供帮助并提交一个PR!
- label: Yes I'd like to help by submitting a PR! 是的!我愿意提供帮助并提交一个PR!
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/question.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,4 @@ body:
- type: textarea
attributes:
label: Additional 额外信息
description: Anything else you would like to share? 其他您想分享的信息。
description: Anything else you would like to share? 其他您想分享的信息。
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ dist
# others
.DS_Store
.idea/
wandb/
__pycache__
2 changes: 0 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -417,5 +417,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
English | [**中文**](README_ZH.md)
English | [**中文**](README_ZH.md)

# Data-Juicer: A One-Stop Data Processing System for Large Language Models

Expand Down Expand Up @@ -26,7 +26,7 @@ English | [**中文**](README_ZH.md)
[![QualityClassifier](https://img.shields.io/badge/Tools-Quality_Classifier-saddlebrown?logo=Markdown)](tools/quality_classifier/README.md)
[![AutoEvaluation](https://img.shields.io/badge/Tools-Auto_Evaluation-saddlebrown?logo=Markdown)](tools/evaluator/README.md)

Data-Juicer is a one-stop data processing system to make data higher-quality,
Data-Juicer is a one-stop data processing system to make data higher-quality,
juicier, and more digestible for LLMs.
This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in promoting LLM data development and research!

Expand Down Expand Up @@ -68,22 +68,22 @@ Table of Contents

![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)

- **Systematic & Reusable**:
Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
dedicated [toolkits](#documentation), designed to
- **Systematic & Reusable**:
Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
dedicated [toolkits](#documentation), designed to
function independently of specific LLM datasets and processing pipelines.

- **Data-in-the-loop**: Allowing detailed data analyses with an automated
- **Data-in-the-loop**: Allowing detailed data analyses with an automated
report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
reference LLaMA models.
- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
reference LLaMA models.
![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)

- **Enhanced Efficiency**: Providing a speedy data processing pipeline
requiring less memory and CPU usage, optimized for maximum productivity.
- **Enhanced Efficiency**: Providing a speedy data processing pipeline
requiring less memory and CPU usage, optimized for maximum productivity.
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)


Expand Down Expand Up @@ -137,13 +137,13 @@ pip install py-data-juicer

### Using Docker

- You can
- You can
- either pull our pre-built image from DockerHub:
```shell
docker pull datajuicer/data-juicer:<version_tag>
```

- or run the following command to build the docker image including the
- or run the following command to build the docker image including the
latest `data-juicer` with provided [Dockerfile](Dockerfile):

```shell
Expand Down
6 changes: 3 additions & 3 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM
* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

* **用户友好**:设计简单易用,提供全面的[文档](#documentation)、简易[入门指南](#快速上手)[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。

* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。


Expand Down Expand Up @@ -301,7 +301,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。

我们非常欢迎贡献新功能、修复漏洞以及讨论。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。

欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。

## 参考文献
如果您发现我们的工作对您的研发有帮助,请引用以下[论文](https://arxiv.org/abs/2309.02033) 。
Expand All @@ -315,4 +315,4 @@ eprint={2309.02033},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
```
2 changes: 1 addition & 1 deletion app.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ class Visualize:
@staticmethod
def filter_dataset(dataset):
if Fields.stats not in dataset.features:
return
return
text_key = st.session_state.get('text_key', 'text')
text = dataset[text_key]
stats = pd.DataFrame(dataset[Fields.stats])
Expand Down
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ We have reproduced the processing flow of some RedPajama datasets. Please refer
We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](reproduced_bloom) folder for details.

### Data-Juicer Recipes
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
2 changes: 1 addition & 1 deletion configs/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ Demo 配置文件用于帮助用户快速熟悉 Data-Juicer 的基本功能,
我们已经重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](reproduced_bloom) 文件夹以获取详细说明。

### Data-Juicer 菜谱
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 CFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 CFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
6 changes: 3 additions & 3 deletions configs/data_juicer_recipes/alpaca_cot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This folder contains some configuration files to allow users to easily and quick

The raw data files can be downloaded from [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) on HuggingFace.

### Convert raw Alpaca-CoT data to jsonl
### Convert raw Alpaca-CoT data to jsonl
Use [raw_alpaca_cot_merge_add_meta.py](../../../tools/preprocess/raw_alpaca_cot_merge_add_meta.py) to select `instruction`, `input` and `output` columns and merge them to `text` field with a space, and add extra [ META ]( #meta_info) info to dataset:

```shell
Expand Down Expand Up @@ -66,7 +66,7 @@ Each sample in refined data of Alpaca-CoT contains meta info listed as below:
* `CFT-SR`: tagged as Single-round Dialog datasets

* `CFT-MR`: tagged as Multi-round Dialog datasets

* `CFT-P`: tagged as Preference datasets


Expand Down Expand Up @@ -111,4 +111,4 @@ Each sample in refined data of Alpaca-CoT contains meta info listed as below:
| StackExchange | MT | COL | EN | StackExchange | || ||
| ConvAI2 | TS | HG | EN | ConvAI2 | || | |
| FastChat | MT | SI | EN | FastChat | || | |
| Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data || | | |
| Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data || | | |
2 changes: 1 addition & 1 deletion configs/data_juicer_recipes/alpaca_cot/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,4 @@ python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alp
| StackExchange | MT | COL | EN | StackExchange | || ||
| ConvAI2 | TS | HG | EN | ConvAI2 | || | |
| FastChat | MT | SI | EN | FastChat | || | |
| Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data || | | |
| Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data || | | |
12 changes: 6 additions & 6 deletions configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,23 @@ open_tracer: true
# a list of several process operators with their arguments
process:
- document_deduplicator: # 104636705
lowercase: true
lowercase: true
ignore_non_character: true

- alphanumeric_filter: # 104636381
tokenization: false
min_ratio: 0.1
min_ratio: 0.1
- character_repetition_filter: # 104630030
rep_len: 10
max_ratio: 0.6
max_ratio: 0.6
- flagged_words_filter: # 104576967
lang: en
tokenization: true
max_ratio: 0.017
max_ratio: 0.017
- maximum_line_length_filter: # 104575811
min_len: 20
- text_length_filter: # 104573711
min_len: 30
min_len: 30

- document_simhash_deduplicator: # 72855345
tokenization: space
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ process:

- alphanumeric_filter: # 16957388
tokenization: false
min_ratio: 0.10
min_ratio: 0.10
- character_repetition_filter: # 16956845
rep_len: 10
max_ratio: 0.6
max_ratio: 0.6
- flagged_words_filter: # 16954629
lang: zh
tokenization: true
Expand Down
2 changes: 1 addition & 1 deletion configs/data_juicer_recipes/redpajama-c4-refine.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ process:
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4
hamming_distance: 4
2 changes: 2 additions & 0 deletions data_juicer/analysis/overall_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import pandas as pd

from data_juicer.utils.constant import Fields


class OverallAnalysis:
"""Apply analysis on the overall stats, including mean, std, quantiles,
etc."""
Expand Down
Loading

0 comments on commit e18292a

Please sign in to comment.