Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement data-model sandbox, with refactoring existing DJ's features and tools #291

Merged
merged 17 commits into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 22 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Table of Contents
- [Data Analysis](#data-analysis)
- [Data Visualization](#data-visualization)
- [Build Up Config Files](#build-up-config-files)
- [Sandbox](#sandbox)
- [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional)
- [For Docker Users](#for-docker-users)
- [Data Recipes](#data-recipes)
Expand All @@ -90,25 +91,25 @@ Table of Contents
- **Systematic & Reusable**:
Empowering users with a systematic library of 80+ core [OPs](docs/Operators.md), 20+ reusable [config recipes](configs), and 20+ feature-rich
dedicated [toolkits](#documentation), designed to
function independently of specific LLM datasets and processing pipelines.
function independently of specific multimodal LLM datasets and processing pipelines.

- **Data-in-the-loop**: Allowing detailed data analyses with an automated
report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
requiring less memory and CPU usage, optimized for maximum productivity.
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
reference LLaMA and LLaVA models.
![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)

- **Enhanced Efficiency**: Providing a speedy data processing pipeline
requiring less memory and CPU usage, optimized for maximum productivity.
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)


- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.

- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documents), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).



Expand Down Expand Up @@ -320,6 +321,18 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang

![Basic config example of format and definition](https://img.alicdn.com/imgextra/i1/O1CN01uXgjgj1khWKOigYww_!!6000000004715-0-tps-1745-871.jpg "Basic config file example")

### Sandbox

The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.

- In the sandbox, users can quickly experiment, iterate, and refine data recipes based on small-scale datasets and models, before scaling up to produce high-quality data to serve large-scale models.
- In addition to the basic data optimization and recipe refinement features offered by Data-Juicer, users can seamlessly use configurable components such as data probe and analysis, model training and evaluation, and data and model feedback-based recipe refinement to form a complete one-stop data-model research and development pipeline.

The sandbox is run using the following commands by default, and for more information and details, please refer to the [sandbox documentation](docs/Sandbox.md).
```shell
python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
```

### Preprocess Raw Data (Optional)
- Our formatters support some common input dataset formats for now:
- Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc.
Expand Down
24 changes: 19 additions & 5 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
- [数据分析](#数据分析)
- [数据可视化](#数据可视化)
- [构建配置文件](#构建配置文件)
- [沙盒实验室](#沙盒实验室)
- [预处理原始数据(可选)](#预处理原始数据可选)
- [对于 Docker 用户](#对于-docker-用户)
- [数据处理菜谱](#数据处理菜谱)
Expand All @@ -80,15 +81,15 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护

![Overview](https://img.alicdn.com/imgextra/i4/O1CN01WYQP3Z1JHsaXaQDK6_!!6000000001004-0-tps-3640-1812.jpg)

* **系统化 & 可复用**:为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md),20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation),旨在让数据处理独立于特定的大语言模型数据集和处理流水线
* **系统化 & 可复用**:为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md),20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation),旨在让多模态数据处理独立于特定的大语言模型数据集和处理流水线

* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
* **数据反馈回路 & 沙盒实验室**:支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
* **效率增强**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)

* **用户友好**:设计简单易用,提供全面的[文档](#documentation)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
* **用户友好**:设计简单易用,提供全面的[文档](#documents)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。

* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。

Expand Down Expand Up @@ -295,6 +296,19 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang

![基础配置项格式及定义样例](https://img.alicdn.com/imgextra/i4/O1CN01xPtU0t1YOwsZyuqCx_!!6000000003050-0-tps-1692-879.jpg "基础配置文件样例")

### 沙盒实验室

数据沙盒实验室 (DJ-Sandbox) 为用户提供了持续生产数据菜谱的最佳实践,其具有低开销、可迁移、有指导性等特点。
- 用户在沙盒中可以基于一些小规模数据集、模型对数据菜谱进行快速实验、迭代、优化,再迁移到更大尺度上,大规模生产高质量数据以服务大模型。
- 用户在沙盒中,除了Data-Juicer基础的数据优化与数据菜谱微调功能外,还可以便捷地使用数据洞察与分析、沙盒模型训练与评测、基于数据和模型反馈优化数据菜谱等可配置组件,共同组成完整的一站式数据-模型研发流水线。

沙盒默认通过如下命令运行,更多介绍和细节请参阅[沙盒文档](docs/Sandbox-ZH.md).
```shell
python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
```



### 预处理原始数据(可选)

* 我们的 Formatter 目前支持一些常见的输入数据集格式:
Expand Down
16 changes: 15 additions & 1 deletion configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ export_path: '/path/to/result/dataset.jsonl' # path to processed
export_shard_size: 0 # shard size of exported dataset in Byte. In default, it's 0, which means export the whole dataset into only one file. If it's set a positive number, the exported dataset will be split into several dataset shards, and the max size of each shard won't larger than the export_shard_size
export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. **Notice**: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
np: 4 # number of subprocess to process your dataset
text_keys: 'content' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
Expand All @@ -22,6 +22,8 @@ op_list_to_trace: [] # only ops in this l
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
keep_stats_in_res_ds: false # whether to keep the computed stats in the result dataset. The intermediate fields to store the stats computed by Filters will be removed if it's False. It's False in default.
keep_hashes_in_res_ds: false # whether to keep the computed hashes in the result dataset. The intermediate fields to store the hashes computed by Deduplicators will be removed if it's False. It's False in default.

# for multimodal data processing
image_key: 'images' # key name of field to store the list of sample image paths.
Expand All @@ -40,6 +42,18 @@ ray_address: auto # the address of the
# only for data analysis
save_stats_in_one_file: false # whether to store all stats result into one file

# for sandbox or hpo
model_infer_config: null # path or dict to model inference configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
model_train_config: null # path or dict to model training configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
model_eval_config: null # path or dict to model evaluation configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
data_eval_config: null # path or dict to data evaluation configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
data_probe_algo: 'uniform' # sampling algorithm for dataset. Should be one of ["uniform", "frequency_specified_field_selector", "topk_specified_field_selector"]. It's "uniform" in default. Only used for dataset sampling.
data_probe_ratio: 1.0 # the sampling ratio to the original dataset size. It's 1.0 in default. Only used for dataset sampling.
path_k_sigma_recipe: null # path to save a configuration file when using k-sigma tool to refine processing recipes
path_model_feedback_recipe: null # path to save a configuration file refined by model feedback
hpo_config: null # path to a configuration file when using auto-HPO tool.


# process schedule: a list of several process operators with their arguments
process:
# Mapper ops. Most of these ops need no arguments.
Expand Down
1 change: 1 addition & 0 deletions configs/demo/sandbox/gpt3_data_quality_eval_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
type: dj_text_quality_classifier
26 changes: 26 additions & 0 deletions configs/demo/sandbox/gpt3_extra_train_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"type": "modelscope",
"model_name": "iic/nlp_gpt3_text-generation_chinese-base",
"trainer_name": "nlp-base-trainer",
"key_remapping": {
"text": "src_txt"
},
"train": {
"max_epochs": 3,
"lr_scheduler": {
"type": "StepLR",
"step_size": 2,
"options": {
"by_epoch": false
}
},
"optimizer": {
"type": "AdamW",
"lr": 3e-4
},
"dataloader": {
"batch_size_per_gpu": 2,
"workers_per_gpu": 0
}
}
}
18 changes: 18 additions & 0 deletions configs/demo/sandbox/gpt3_extra_train_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
type: modelscope
model_name: "iic/nlp_gpt3_text-generation_chinese-base"
trainer_name: "nlp-base-trainer"
key_remapping:
text: "src_txt"
train:
max_epochs: 2
lr_scheduler:
type: "StepLR"
step_size: 2
options:
by_epoch: false
optimizer:
type: "AdamW"
lr: 0.0003
dataloader:
batch_size_per_gpu: 2
workers_per_gpu: 0
27 changes: 27 additions & 0 deletions configs/demo/sandbox/sandbox.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Sandbox config example for dataset

# global parameters
project_name: 'demo-sandbox'
dataset_path: './demos/data/demo-dataset.jsonl' # path to your dataset directory or file
np: 4 # number of subprocess to process your dataset

export_path: './outputs/demo-sandbox/demo-sandbox.jsonl'

# sandbox configs
# for refining recipe using k-sigma rules
path_k_sigma_recipe: './outputs/demo-sandbox/k_sigma_new_recipe.yaml'

# for gpt3 quality classifier as data evaluator
data_eval_config: 'configs/demo/sandbox/gpt3_data_quality_eval_config.yaml'
#data_eval_config:
# type: dj_text_quality_classifier

# for gpt3 model training
model_train_config: 'configs/demo/sandbox/gpt3_extra_train_config.json'

# process schedule
# a list of several process operators with their arguments
process:
- language_id_score_filter:
lang: 'zh'
min_score: 0.5
Loading
Loading