Enhance/mmc4 converting tools (#91)

* + Add two tools to convert MMC4-like dataset to Data-Juicer format and reverse. * + Add docs for tools of MMC4 * * fix wrong format examples
modelscope · Nov 29, 2023 · 0da7086 · 0da7086
1 parent 754720a
commit 0da7086
Show file tree

Hide file tree

Showing 6 changed files with 589 additions and 7 deletions.
diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md
@@ -18,6 +18,7 @@ For now, dataset formats that are supported by Data-Juicer are listed in the fol
 | Format     | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref.                                                                                                             |
 |------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
 | LLaVA-like | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| MMC4-like  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [Format Description](https://github.com/allenai/mmc4#documents)                                                  |
 | WavCaps-like  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 For all tools, you can run the following command to find out the usage of them:
@@ -93,6 +94,19 @@ and converted datasets, so we can regard this sample is aligned with the origina
 ]
 ```
 
+#### MMC4-like
+
+The format of MMC4-like datasets are defined [here](https://github.com/allenai/mmc4#documents). Except `image_info` and `text_list`,
+which are used when converting them to Data-Juicer format, there is an important field `similarity_matrix`. Similarity matrix is
+a matrix of shape `len(image_info) x len(text_list)`, which means it highly depends on the numbers of images and text sentences and their 
+orders.
+
+However, when processing such datasets with Data-Juicer, images or sentences might be removed from a sample by Filters, and they could be
+modified by some Mappers. Thus, after processing, this similarity matrix might be no longer aligned with `image_info` or `text_list`.
+Users should be cautious about this point if you need this matrix in later usages.
+
+Despite these extra fields, tools for MMC4 can perfectly convert MMC4-like datasets to Data-Juicer-format datasets and convert them back~
+
 ### WavCaps-like
 
 The [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) is composed of four sub-datasets: [FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/) and [AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html). Each sub-dataset has different fields. For example, the 'description' field is included in SoundBible, but does not exist in AudioSet. To ensure that the different sub-datasets can be properly merged after conversion, the union of all fields from the sub-datasets is used during the wavcaps_to_dj stage, and all fields are fully retained during the dj_to_wavcaps stage.

diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md
@@ -12,9 +12,10 @@
 
 目前，Data-Juicer 支持的数据集格式在下面表格中列出。
 
-| 格式        | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
-|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
-| 类LLaVA格式  | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 格式       | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
+|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
+| 类LLaVA格式 | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 类MMC4格式  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [格式描述](https://github.com/allenai/mmc4#documents) |
 | 类WavCaps格式  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 对于所有工具，您可以运行以下命令来了解它们的详细用法：
@@ -76,6 +77,14 @@ python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --hel
 ]
 ```
 
+#### 类MMC4格式
+
+类MMC4数据集的格式在 [这里](https://github.com/allenai/mmc4#documents) 定义。除了在转换为Data-Juicer格式时使用的`image_info`和`text_list`之外，还有一个重要的字段`similarity_matrix`，即相似度矩阵。相似度矩阵是一个形状为`len(image_info) x len(text_list)`的矩阵，这意味着它高度依赖于图像和文本句子的数量及其顺序。
+
+然而，当使用Data-Juicer处理这些数据集时，图像或句子可能会被Filter算子从样本中移除，并且它们可能会被一些Mapper算子修改。因此，在处理后，这个相似度矩阵可能无法与`image_info`或`text_list`对齐。如果用户在后续使用中需要这个矩阵，那您应该注意到这一点。
+
+除了这些额外字段外，针对类MMC4格式的工具可以完美地将类MMC4格式的数据集转换为Data-Juicer格式的数据集，并将它们转换回去~
+
 #### 类WavCaps格式
 [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) 数据集由 [FreeSound](https://freesound.org/)，[BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/)，[SoundBible](https://soundbible.com/)，[AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html) 四个子数据集组成，每个数据集里都有不同的字段。例如SoundBible里包含了‘description’字段，而该字段在AudioSet里并不存在。为了保证不同子数据集在转换后能够正常合并，在wavcaps_to_dj阶段使用了所有子数据集字段的并集，并在dj_to_wavcaps阶段完整保留了所有字段。
 ```json

diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py
@@ -1,5 +1,5 @@
 # This tool is used to convert multimodal dataset in Data-Juicer format to a
-# target dataset in LLaVA format.
+# target dataset in LLaVA-like format.
 #
 # Corresponding Data-Juicer format:
 #   - multi-chunk interleaved image-text sequence
@@ -101,7 +101,7 @@ def main(
         extra argument original_llava_ds_path is required. When the processed
         and converted dataset will be used in another machine, it's better to
         set this argument to True. Default: False.
-    :param original_llava_ds_path: path to the original unprocessed llava
+    :param original_llava_ds_path: path to the original unprocessed LLaVA
         dataset, which is used to help to recover the relative image paths for
         better migration. Default: None.
     """