Skip to content

Commit

Permalink
* modify tool dir names to a detailed version
Browse files Browse the repository at this point in the history
  • Loading branch information
HYLcool committed Nov 10, 2023
1 parent d0e69d7 commit 672839d
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 26 deletions.
26 changes: 13 additions & 13 deletions tools/multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,20 @@ provided several dataset format conversion tools for some popular multimodal
works.

These tools consist of two types:
- Other format to Data-Juicer format: These tools are in `ds2dj` directory. They help to convert datasets in other formats to target datasets in Data-Juicer format.
- Data-Juicer format to other format: These tools are in `dj2ds` directory. They help to convert datasets in Data-Juicer formats to target datasets in target format.
- Other format to Data-Juicer format: These tools are in `source_format_to_data_juicer_format` directory. They help to convert datasets in other formats to target datasets in Data-Juicer format.
- Data-Juicer format to other format: These tools are in `data_juicer_format_to_target_format` directory. They help to convert datasets in Data-Juicer formats to target datasets in target format.

For now, dataset formats that are supported by Data-Juicer are listed in the following table.

| Format | ds2dj | dj2ds | Ref. |
|------------|------------------------------------------------------------------------| --- |------------------------|
| LLaVA-like | `llava2dj.py` | `dj2llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |

For all tools, you can run the following command to find out the usage of them:

```shell
# e.g. llava2dj.py
python tools/multimodal/ds2dj/llava2dj.py --help
# e.g. llava_to_dj.py
python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
```

Before using these tools, you might need to take a glance at the reference
Expand All @@ -46,12 +46,12 @@ and show how these variations influence the dataset format. The table below
shows the number of different samples between the original dataset and the
dataset after processing. There are 665,298 samples in the original dataset.

| process | # of diff. |
|---------------------------------------------------------------------------------------|-------------|
| 1. apply `llava2dj.py` and `dj2llava.py` | 113,501 |
| 2. convert integer ids to string ids in the original dataset | 41,361 |
| 3. strip whitespaces before and after values of conversations in the original dataset | 40,688 |
| 4. add `'model': ''` fields in the converted dataset | 1 |
| process | # of diff. |
|----------------------------------------------------------------------------------------|-------------|
| 1. apply `llava_to_dj.py` and `dj_to_llava.py` | 113,501 |
| 2. convert integer ids to string ids in the original dataset | 41,361 |
| 3. strip whitespaces before and after values of conversations in the original dataset | 40,688 |
| 4. add `'model': ''` fields in the converted dataset | 1 |

It's worth noticing that processes 2-4 won't influence the semantics of sample conversations in the dataset.
Thus we think the dataset after conversion can align with the original dataset.
Expand Down
26 changes: 13 additions & 13 deletions tools/multimodal/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,20 @@
由于不同多模态数据集和工作之间的数据集格式差异较大,Data-Juicer 提出了一种新颖的多模态数据集中间格式,并为一些流行的多模态工作提供了若干数据集格式转换工具。

这些工具分为两种类型:
- 其他格式到 Data-Juicer 格式的转换:这些工具在 `ds2dj` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
- Data-Juicer 格式到其他格式的转换:这些工具在 `dj2ds` 目录中。它们可以帮助将 Data-Juicer 格式的数据集转换为目标格式的数据集。
- 其他格式到 Data-Juicer 格式的转换:这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
- Data-Juicer 格式到其他格式的转换:这些工具在 `data_juicer_format_to_target_format` 目录中。它们可以帮助将 Data-Juicer 格式的数据集转换为目标格式的数据集。

目前,Data-Juicer 支持的数据集格式在下面表格中列出。

| 格式 | ds2dj | dj2ds | 格式参考 |
|----------|------------------------------------------------------------------------| --- |----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava2dj.py` | `dj2llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |

对于所有工具,您可以运行以下命令来了解它们的详细用法:

```shell
# 例如:llava2dj.py
python tools/multimodal/ds2dj/llava2dj.py --help
# 例如:llava_to_dj.py
python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
```
在使用这些工具之前,您可能需要查看上表中每个格式的参考资料,以更好地了解详细的格式信息,并理解每个工具的参数含义。

Expand All @@ -32,12 +32,12 @@ python tools/multimodal/ds2dj/llava2dj.py --help

这里我们以LLaVA的 [视觉指令微调数据集](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) 为例,展示这些变化如何影响数据集的格式。下表显示了原始数据集和经过若干处理后数据集之间不同样本的数量。原始数据集中有665,298个样本。

| 处理过程 | 不同样本数目 |
|-------------------------------------|---------|
| 1. 运行 `llava2dj.py``dj2llava.py` | 113,501 |
| 2. 将源数据集的id字段由整型转为字符串类型 | 41,361 |
| 3. 将源数据集中对话的所有value字段前后的空格去除 | 40,688 |
| 4. 在转换后的数据集样本中添加 `'model': ''` 字段 | 1 |
| 处理过程 | 不同样本数目 |
|-------------------------------------------|---------|
| 1. 运行 `llava_to_dj.py``dj_to_llava.py` | 113,501 |
| 2. 将源数据集的id字段由整型转为字符串类型 | 41,361 |
| 3. 将源数据集中对话的所有value字段前后的空格去除 | 40,688 |
| 4. 在转换后的数据集样本中添加 `'model': ''` 字段 | 1 |

值得注意的是,处理过程 2-4 并不会影响数据集中样本对话的语义,因此我们可以认为数据集格式转换工具的转换结果能够对齐源数据集。

Expand Down
File renamed without changes.
File renamed without changes.

0 comments on commit 672839d

Please sign in to comment.