Skip to content

Commit

Permalink
Enhance/mmc4 converting tools (#91)
Browse files Browse the repository at this point in the history
* + Add two tools to convert MMC4-like dataset to Data-Juicer format and reverse.

* + Add docs for tools of MMC4

* * fix wrong format examples
  • Loading branch information
HYLcool authored Nov 29, 2023
1 parent 754720a commit 0da7086
Show file tree
Hide file tree
Showing 6 changed files with 589 additions and 7 deletions.
14 changes: 14 additions & 0 deletions tools/multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ For now, dataset formats that are supported by Data-Juicer are listed in the fol
| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

For all tools, you can run the following command to find out the usage of them:
Expand Down Expand Up @@ -93,6 +94,19 @@ and converted datasets, so we can regard this sample is aligned with the origina
]
```

#### MMC4-like

The format of MMC4-like datasets are defined [here](https://github.com/allenai/mmc4#documents). Except `image_info` and `text_list`,
which are used when converting them to Data-Juicer format, there is an important field `similarity_matrix`. Similarity matrix is
a matrix of shape `len(image_info) x len(text_list)`, which means it highly depends on the numbers of images and text sentences and their
orders.

However, when processing such datasets with Data-Juicer, images or sentences might be removed from a sample by Filters, and they could be
modified by some Mappers. Thus, after processing, this similarity matrix might be no longer aligned with `image_info` or `text_list`.
Users should be cautious about this point if you need this matrix in later usages.

Despite these extra fields, tools for MMC4 can perfectly convert MMC4-like datasets to Data-Juicer-format datasets and convert them back~

### WavCaps-like

The [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) is composed of four sub-datasets: [FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/) and [AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html). Each sub-dataset has different fields. For example, the 'description' field is included in SoundBible, but does not exist in AudioSet. To ensure that the different sub-datasets can be properly merged after conversion, the union of all fields from the sub-datasets is used during the wavcaps_to_dj stage, and all fields are fully retained during the dj_to_wavcaps stage.
Expand Down
15 changes: 12 additions & 3 deletions tools/multimodal/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@

目前,Data-Juicer 支持的数据集格式在下面表格中列出。

| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

对于所有工具,您可以运行以下命令来了解它们的详细用法:
Expand Down Expand Up @@ -76,6 +77,14 @@ python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --hel
]
```

#### 类MMC4格式

类MMC4数据集的格式在 [这里](https://github.com/allenai/mmc4#documents) 定义。除了在转换为Data-Juicer格式时使用的`image_info``text_list`之外,还有一个重要的字段`similarity_matrix`,即相似度矩阵。相似度矩阵是一个形状为`len(image_info) x len(text_list)`的矩阵,这意味着它高度依赖于图像和文本句子的数量及其顺序。

然而,当使用Data-Juicer处理这些数据集时,图像或句子可能会被Filter算子从样本中移除,并且它们可能会被一些Mapper算子修改。因此,在处理后,这个相似度矩阵可能无法与`image_info``text_list`对齐。如果用户在后续使用中需要这个矩阵,那您应该注意到这一点。

除了这些额外字段外,针对类MMC4格式的工具可以完美地将类MMC4格式的数据集转换为Data-Juicer格式的数据集,并将它们转换回去~

#### 类WavCaps格式
[WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) 数据集由 [FreeSound](https://freesound.org/)[BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/)[SoundBible](https://soundbible.com/)[AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html) 四个子数据集组成,每个数据集里都有不同的字段。例如SoundBible里包含了‘description’字段,而该字段在AudioSet里并不存在。为了保证不同子数据集在转换后能够正常合并,在wavcaps_to_dj阶段使用了所有子数据集字段的并集,并在dj_to_wavcaps阶段完整保留了所有字段。
```json
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This tool is used to convert multimodal dataset in Data-Juicer format to a
# target dataset in LLaVA format.
# target dataset in LLaVA-like format.
#
# Corresponding Data-Juicer format:
# - multi-chunk interleaved image-text sequence
Expand Down Expand Up @@ -101,7 +101,7 @@ def main(
extra argument original_llava_ds_path is required. When the processed
and converted dataset will be used in another machine, it's better to
set this argument to True. Default: False.
:param original_llava_ds_path: path to the original unprocessed llava
:param original_llava_ds_path: path to the original unprocessed LLaVA
dataset, which is used to help to recover the relative image paths for
better migration. Default: None.
"""
Expand Down
Loading

0 comments on commit 0da7086

Please sign in to comment.