Skip to content

Commit

Permalink
sync terms according to arXiv paper v2 (#31)
Browse files Browse the repository at this point in the history
* sync terms according to arXiv paper v2:
- post-tuning --> fine-tuning
- sft --> CFT-SR
- multi-round dialog --> CFT-MR
- preference --> CFT-P

* sync terms according to arXiv paper v2:
- post-tuning --> fine-tuning
- sft --> CFT-SR
- multi-round dialog --> CFT-MR
- preference --> CFT-P

* 新增dockerhub安装方式

* sft-->cft in file names
  • Loading branch information
yxdyc authored Oct 11, 2023
1 parent 5bd715d commit 1cb9262
Show file tree
Hide file tree
Showing 19 changed files with 272 additions and 298 deletions.
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Table of Contents

## Features

![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uvLL0T1VIaX28dMLg_!!6000000002630-2-tps-2509-1192.png)
![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)

- **Systematic & Reusable**:
Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
Expand All @@ -64,8 +64,7 @@ Table of Contents
![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for
pre-training, post-tuning, en, zh, and more scenarios. Validated on
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
reference LLaMA models.
![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)

Expand Down Expand Up @@ -124,11 +123,18 @@ pip install py-data-juicer

### Using Docker

- Run the following command to build the docker image including the latest `data-juicer` with provided [Dockerfile](Dockerfile):
- You can
- either pull our pre-built image from DockerHub:
```shell
docker pull datajuicer/data-juicer:<version_tag>
```

```shell
docker build -t data-juicer:<version_tag> .
```
- or run the following command to build the docker image including the
latest `data-juicer` with provided [Dockerfile](Dockerfile):

```shell
docker build -t data-juicer:<version_tag> .
```

### Installation check

Expand Down Expand Up @@ -271,7 +277,7 @@ docker exec -it <container_id> bash
- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
- [Recipes for data process in RedPajama](configs/redpajama/README.md)
- [Refined recipes for pre-training data](configs/data_juicer_recipes/README.md)
- [Refined recipes for post-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)
- [Refined recipes for fine-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)

## Demos
- Introduction to Data-Juicer [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)]
Expand Down
19 changes: 12 additions & 7 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,13 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

## 特点

![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uvLL0T1VIaX28dMLg_!!6000000002630-2-tps-2509-1192.png)
![Overview](https://img.alicdn.com/imgextra/i2/O1CN01IMPeD11xYRUYLmXKO_!!6000000006455-2-tps-3620-1604.png)

* **系统化 & 可复用**:为用户提供系统化且可复用的20+[配置菜谱](configs/README_ZH.md),50+核心[算子](docs/Operators_ZH.md)和专用[工具池](#documentation),旨在让数据处理独立于特定的大语言模型数据集和处理流水线。

* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i1/O1CN011E99C01ndLZ55iCUS_!!6000000005112-0-tps-2701-1050.jpg)

* **全面的数据处理菜谱**:为pre-training、post-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)
* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)![exp_llama](https://img.alicdn.com/imgextra/i1/O1CN01D0Nuof1uOPxvBVEq9_!!6000000006027-0-tps-2324-754.jpg)

* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

Expand Down Expand Up @@ -112,11 +112,16 @@ pip install py-data-juicer

### 使用 Docker 安装

- 运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:
- 您可以选择
- 从DockerHub直接拉取我们的预置镜像:
```shell
docker pull datajuicer/data-juicer:<version_tag>
```
- 或者运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:

```shell
docker build -t data-juicer:<version_tag> .
```
```shell
docker build -t data-juicer:<version_tag> .
```

### 安装校验

Expand Down Expand Up @@ -253,7 +258,7 @@ docker exec -it <container_id> bash
* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md)
* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md)
* [预训练数据增强菜谱](configs/data_juicer_recipes/README_ZH.md)
* [Post-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)
* [Fine-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)

## 演示样例

Expand Down
2 changes: 1 addition & 1 deletion app.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ def set_sliders(total_stats, ordered):

@staticmethod
def diversity():
with st.expander('Diversity for sft dataset', expanded=False):
with st.expander('Diversity for CFT dataset', expanded=False):
dataset = st.session_state.get('dataset', None)
cfg = st.session_state.get('cfg', parse_cfg()[2])
text_key = st.session_state.get('text_key', 'text')
Expand Down
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ We have reproduced the processing flow of some RedPajama datasets. Please refer
We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](reproduced_bloom) folder for details.

### Data-Juicer Recipes
We have refined some open source datasets (including SFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
2 changes: 1 addition & 1 deletion configs/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ Demo 配置文件用于帮助用户快速熟悉 Data-Juicer 的基本功能,
我们已经重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](reproduced_bloom) 文件夹以获取详细说明。

### Data-Juicer 菜谱
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 SFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 CFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
4 changes: 2 additions & 2 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
4 changes: 2 additions & 2 deletions configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@

| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/SFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
Loading

0 comments on commit 1cb9262

Please sign in to comment.