Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync terms according to arXiv paper v2 #31

Merged
merged 4 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Table of Contents

## Features

![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uvLL0T1VIaX28dMLg_!!6000000002630-2-tps-2509-1192.png)
![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)

- **Systematic & Reusable**:
Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
Expand All @@ -64,8 +64,7 @@ Table of Contents
![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for
pre-training, post-tuning, en, zh, and more scenarios. Validated on
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
reference LLaMA models.
![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)

Expand Down Expand Up @@ -124,11 +123,18 @@ pip install py-data-juicer

### Using Docker

- Run the following command to build the docker image including the latest `data-juicer` with provided [Dockerfile](Dockerfile):
- You can
- either pull our pre-built image from DockerHub:
```shell
docker pull datajuicer/data-juicer:<version_tag>
```

```shell
docker build -t data-juicer:<version_tag> .
```
- or run the following command to build the docker image including the
latest `data-juicer` with provided [Dockerfile](Dockerfile):

```shell
docker build -t data-juicer:<version_tag> .
```

### Installation check

Expand Down Expand Up @@ -271,7 +277,7 @@ docker exec -it <container_id> bash
- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
- [Recipes for data process in RedPajama](configs/redpajama/README.md)
- [Refined recipes for pre-training data](configs/data_juicer_recipes/README.md)
- [Refined recipes for post-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)
- [Refined recipes for fine-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)

## Demos
- Introduction to Data-Juicer [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)]
Expand Down
19 changes: 12 additions & 7 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,13 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

## 特点

![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uvLL0T1VIaX28dMLg_!!6000000002630-2-tps-2509-1192.png)
![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)

* **系统化 & 可复用**:为用户提供系统化且可复用的20+[配置菜谱](configs/README_ZH.md),50+核心[算子](docs/Operators_ZH.md)和专用[工具池](#documentation),旨在让数据处理独立于特定的大语言模型数据集和处理流水线。

* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)

* **全面的数据处理菜谱**:为pre-training、post-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 ![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)
* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 ![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)

* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

Expand Down Expand Up @@ -112,11 +112,16 @@ pip install py-data-juicer

### 使用 Docker 安装

- 运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:
- 您可以选择
- 从DockerHub直接拉取我们的预置镜像:
```shell
docker pull datajuicer/data-juicer:<version_tag>
```
- 或者运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:

```shell
docker build -t data-juicer:<version_tag> .
```
```shell
docker build -t data-juicer:<version_tag> .
```

### 安装校验

Expand Down Expand Up @@ -253,7 +258,7 @@ docker exec -it <container_id> bash
* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md)
* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md)
* [预训练数据增强菜谱](configs/data_juicer_recipes/README_ZH.md)
* [Post-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)
* [Fine-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)

## 演示样例

Expand Down
2 changes: 1 addition & 1 deletion app.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ def set_sliders(total_stats, ordered):

@staticmethod
def diversity():
with st.expander('Diversity for sft dataset', expanded=False):
with st.expander('Diversity for CFT dataset', expanded=False):
dataset = st.session_state.get('dataset', None)
cfg = st.session_state.get('cfg', parse_cfg()[2])
text_key = st.session_state.get('text_key', 'text')
Expand Down
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ We have reproduced the processing flow of some RedPajama datasets. Please refer
We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](reproduced_bloom) folder for details.

### Data-Juicer Recipes
We have refined some open source datasets (including SFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
2 changes: 1 addition & 1 deletion configs/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ Demo 配置文件用于帮助用户快速熟悉 Data-Juicer 的基本功能,
我们已经重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](reproduced_bloom) 文件夹以获取详细说明。

### Data-Juicer 菜谱
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 SFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 CFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
4 changes: 2 additions & 2 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
4 changes: 2 additions & 2 deletions configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@

| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
Loading