diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md
index b950ea08b..d4559f689 100644
--- a/tools/multimodal/README.md
+++ b/tools/multimodal/README.md
@@ -18,6 +18,7 @@ For now, dataset formats that are supported by Data-Juicer are listed in the fol
 | Format     | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref.                                                                                                             |
 |------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
 | LLaVA-like | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| MMC4-like  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [Format Description](https://github.com/allenai/mmc4#documents)                                                  |
 | WavCaps-like  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 For all tools, you can run the following command to find out the usage of them:
@@ -93,6 +94,19 @@ and converted datasets, so we can regard this sample is aligned with the origina
 ]
 ```
 
+#### MMC4-like
+
+The format of MMC4-like datasets are defined [here](https://github.com/allenai/mmc4#documents). Except `image_info` and `text_list`,
+which are used when converting them to Data-Juicer format, there is an important field `similarity_matrix`. Similarity matrix is
+a matrix of shape `len(image_info) x len(text_list)`, which means it highly depends on the numbers of images and text sentences and their 
+orders.
+
+However, when processing such datasets with Data-Juicer, images or sentences might be removed from a sample by Filters, and they could be
+modified by some Mappers. Thus, after processing, this similarity matrix might be no longer aligned with `image_info` or `text_list`.
+Users should be cautious about this point if you need this matrix in later usages.
+
+Despite these extra fields, tools for MMC4 can perfectly convert MMC4-like datasets to Data-Juicer-format datasets and convert them back~
+
 ### WavCaps-like
 
 The [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) is composed of four sub-datasets: [FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/) and [AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html). Each sub-dataset has different fields. For example, the 'description' field is included in SoundBible, but does not exist in AudioSet. To ensure that the different sub-datasets can be properly merged after conversion, the union of all fields from the sub-datasets is used during the wavcaps_to_dj stage, and all fields are fully retained during the dj_to_wavcaps stage.
diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md
index be63af955..55671e09b 100644
--- a/tools/multimodal/README_ZH.md
+++ b/tools/multimodal/README_ZH.md
@@ -12,9 +12,10 @@
 
 目前，Data-Juicer 支持的数据集格式在下面表格中列出。
 
-| 格式        | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
-|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
-| 类LLaVA格式  | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 格式       | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
+|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
+| 类LLaVA格式 | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 类MMC4格式  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [格式描述](https://github.com/allenai/mmc4#documents) |
 | 类WavCaps格式  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 对于所有工具，您可以运行以下命令来了解它们的详细用法：
@@ -76,6 +77,14 @@ python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --hel
 ]
 ```
 
+#### 类MMC4格式
+
+类MMC4数据集的格式在 [这里](https://github.com/allenai/mmc4#documents) 定义。除了在转换为Data-Juicer格式时使用的`image_info`和`text_list`之外，还有一个重要的字段`similarity_matrix`，即相似度矩阵。相似度矩阵是一个形状为`len(image_info) x len(text_list)`的矩阵，这意味着它高度依赖于图像和文本句子的数量及其顺序。
+
+然而，当使用Data-Juicer处理这些数据集时，图像或句子可能会被Filter算子从样本中移除，并且它们可能会被一些Mapper算子修改。因此，在处理后，这个相似度矩阵可能无法与`image_info`或`text_list`对齐。如果用户在后续使用中需要这个矩阵，那您应该注意到这一点。
+
+除了这些额外字段外，针对类MMC4格式的工具可以完美地将类MMC4格式的数据集转换为Data-Juicer格式的数据集，并将它们转换回去~
+
 #### 类WavCaps格式
 [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) 数据集由 [FreeSound](https://freesound.org/)，[BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/)，[SoundBible](https://soundbible.com/)，[AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html) 四个子数据集组成，每个数据集里都有不同的字段。例如SoundBible里包含了‘description’字段，而该字段在AudioSet里并不存在。为了保证不同子数据集在转换后能够正常合并，在wavcaps_to_dj阶段使用了所有子数据集字段的并集，并在dj_to_wavcaps阶段完整保留了所有字段。
 ```json
diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py
index c58a06604..cb9cf7a42 100644
--- a/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py
+++ b/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py
@@ -1,5 +1,5 @@
 # This tool is used to convert multimodal dataset in Data-Juicer format to a
-# target dataset in LLaVA format.
+# target dataset in LLaVA-like format.
 #
 # Corresponding Data-Juicer format:
 #   - multi-chunk interleaved image-text sequence
@@ -101,7 +101,7 @@ def main(
         extra argument original_llava_ds_path is required. When the processed
         and converted dataset will be used in another machine, it's better to
         set this argument to True. Default: False.
-    :param original_llava_ds_path: path to the original unprocessed llava
+    :param original_llava_ds_path: path to the original unprocessed LLaVA
         dataset, which is used to help to recover the relative image paths for
         better migration. Default: None.
     """
diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py
new file mode 100644
index 000000000..aba76dfa3
--- /dev/null
+++ b/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py
@@ -0,0 +1,304 @@
+# This tool is used to convert multimodal dataset in Data-Juicer format to a
+# target dataset in MMC4 format. Notice: if the similarity matrix is included
+# in the dataset, it might not be able to be restored to the original
+# correlation and could be with wrong shape due to some images or text
+# sentences might be removed. So this tool will do nothing to the similarity
+# matrix.
+#
+# MMC4 in Data-Juicer format:
+#   - two new fields are added:
+#       - text: multi-chunk interleaved image-text sequence in one string. Each
+#           sentence in the original dataset is a chunk in this text string.
+#       - images: image paths list
+#   - other fields in the original format can be kept or not
+#   - in jsonl
+# {'text': 'When you lock the door using the lock tab on the driver’s door, '
+#          'all of the other doors and tailgate lock at the same time. '
+#          '<|__dj__eoc|> <__dj__image> Press the master door lock switch in '
+#          'as shown to lock or unlock all doors and the tailgate. '
+#          '<|__dj__eoc|> <__dj__image> When you lock/unlock the driver’s '
+#          'door and tailgate using the master lock switch, all the other '
+#          'doors lock/ unlock at the same time. <|__dj__eoc|>',
+#  'images': ['db1c21bc8474.jpg', 'b9040a0dbb22.jpg'],
+#  'image_info': [{'face_detections': None,
+#                  'image_name': 'b9040a0dbb22.jpg',
+#                  'matched_sim': 0.27694183588027954,
+#                  'matched_text_index': 2,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},  # noqa: E501
+#                 {'face_detections': None,
+#                  'image_name': 'db1c21bc8474.jpg',
+#                  'matched_sim': 0.3234919607639313,
+#                  'matched_text_index': 1,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],  # noqa: E501
+#  'similarity_matrix': [[0.24363446235656738,
+#                         0.31758785247802734,
+#                         0.27694183588027954],
+#                        [0.2233106791973114,
+#                         0.3234919607639313,
+#                         0.26118797063827515]],
+#  'text_list': ['When you lock the door using the lock tab on the driver’s '
+#                'door, all of the other doors and tailgate lock at the same '
+#                'time.',
+#                'Press the master door lock switch in as shown to lock or '
+#                'unlock all doors and the tailgate.',
+#                'When you lock/unlock the driver’s door and tailgate using the '  # noqa: E501
+#                'master lock switch, all the other doors lock/ unlock at the '
+#                'same time.'],
+#  'url': 'http://www.hfitinfo.com/hofi-48.html',
+#  'could_have_url_duplicate': 0 }
+#
+# MMC4 format:
+#   - interleaved image-text sequence
+#   - in jsonl
+#   - extra information except "image_name", "matched_text_index", "text_list"
+#       will be included only if they are kept when converting the original
+#       MMC4 format to Data-Juicer format. (keep_other_fields is True)
+# {'image_info': [{'face_detections': None,
+#                  'image_name': 'b9040a0dbb22.jpg',
+#                  'matched_sim': 0.27694183588027954,
+#                  'matched_text_index': 2,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},  # noqa: E501
+#                 {'face_detections': None,
+#                  'image_name': 'db1c21bc8474.jpg',
+#                  'matched_sim': 0.3234919607639313,
+#                  'matched_text_index': 1,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],  # noqa: E501
+#  'similarity_matrix': [[0.24363446235656738,
+#                         0.31758785247802734,
+#                         0.27694183588027954],
+#                        [0.2233106791973114,
+#                         0.3234919607639313,
+#                         0.26118797063827515]],
+#  'text_list': ['When you lock the door using the lock tab on the driver’s '
+#                'door, all of the other doors and tailgate lock at the same '
+#                'time.',
+#                'Press the master door lock switch in as shown to lock or '
+#                'unlock all doors and the tailgate.',
+#                'When you lock/unlock the driver’s door and tailgate using the '  # noqa: E501
+#                'master lock switch, all the other doors lock/ unlock at the '
+#                'same time.'],
+#  'url': 'http://www.hfitinfo.com/hofi-48.html',
+#  'could_have_url_duplicate': 0 }
+#
+# Reference:
+# https://github.com/allenai/mmc4#documents
+
+import os
+from copy import deepcopy
+
+import fire
+import jsonlines as jl
+from loguru import logger
+from tqdm import tqdm
+
+from data_juicer.utils.mm_utils import SpecialTokens
+
+
+@logger.catch
+def main(
+    dj_ds_path: str,
+    target_mmc4_ds_path: str,
+    eoc_special_token: str = SpecialTokens.eoc,
+    image_special_token: str = SpecialTokens.image,
+    sent_seperator: str = ' ',
+    keep_dj_fields: bool = False,
+    convert_to_relative_paths: bool = False,
+    original_mmc4_ds_path: str = None,
+):
+    """
+    Convert a Data-Juicer-format dataset to an MMC4-like format. Notice: if
+    the similarity matrix is included in the dataset, it might not be able to
+    be restored to the original correlation and could be with wrong shape due
+    to some images or text sentences might be removed. So this tool will do
+    nothing to the similarity matrix.
+
+    :param dj_ds_path: path to the input dataset in Data-Juicer format.
+    :param target_mmc4_ds_path: path to store the converted dataset in MMC4
+        format.
+    :param eoc_special_token: the special token for "end of a chunk". It's used
+        to split sentence chunks explicitly. Default: <|__dj__eoc|> (from
+        Data-Juicer).
+    :param image_special_token: the special token for images. It's used to
+        locate the images in the conversation. In typical MMC4-like datasets,
+        this special token is not specified. So we simply use the default image
+        special token from our Data-Juicer. Default: <__dj__image> (from
+        Data-Juicer).
+    :param sent_seperator: seperator to split different sentences. Default: " "
+    :param keep_dj_fields: whether to keep intermediate fields from
+        Data-Juicer, such as "images", "text", ... Default: False.
+    :param convert_to_relative_paths: whether convert the image paths in this
+        dataset to relative paths to the original dataset. If it's True, an
+        extra argument original_llava_ds_path is required. When the processed
+        and converted dataset will be used in another machine, it's better to
+        set this argument to True. Default: False.
+    :param original_mmc4_ds_path: path to the original unprocessed MMC4
+        dataset, which is used to help to recover the relative image paths for
+        better migration. Default: None.
+    """
+    # ----- Constant settings. Better not to change them. -----
+    # default key of field to store the sample text
+    text_key = 'text'
+    # default key of field to store the image list
+    image_key = 'images'
+    # ----- Constant settings. Better not to change them. -----
+
+    # check arguments
+    # check paths
+    if not os.path.exists(dj_ds_path):
+        raise FileNotFoundError(
+            f'Input dataset [{dj_ds_path}] can not be found.')
+    if not target_mmc4_ds_path.endswith('.jsonl'):
+        raise ValueError(
+            'Only support "jsonl" target dataset file for MMC4 now.')
+    if os.path.dirname(target_mmc4_ds_path) \
+            and not os.path.exists(os.path.dirname(target_mmc4_ds_path)):
+        logger.info(
+            f'Create directory [{os.path.dirname(target_mmc4_ds_path)}] for '
+            f'the target dataset.')
+        os.makedirs(os.path.dirname(target_mmc4_ds_path))
+
+    # if convert_to_relative_paths is True, check if the original_llava_ds_path
+    # is provided as well.
+    if convert_to_relative_paths:
+        if not original_mmc4_ds_path:
+            raise ValueError('When convert_to_relative_paths is set to True, '
+                             'the original_llava_ds_path must be provided '
+                             'for recovering the relative paths. Please '
+                             'check and retry.')
+        original_mmc4_ds_path = os.path.abspath(original_mmc4_ds_path)
+        # if provided original_mmc4_ds_path is the dataset file path, only
+        # keep the directory path.
+        if os.path.isfile(original_mmc4_ds_path):
+            original_mmc4_ds_path = os.path.dirname(original_mmc4_ds_path)
+
+    # whether to keep dj fields
+    if keep_dj_fields:
+        logger.warning('You choose to keep intermediate fields added when '
+                       'converting to Data-Juicer format, which are usually '
+                       'useless in the final dataset but it will increase the '
+                       'size of the whole dataset file.')
+
+    # load MMC4 dataset
+    logger.info('Start converting the original dataset to MMC4 format...')
+    with jl.open(dj_ds_path, 'r') as reader:
+        with jl.open(target_mmc4_ds_path, 'w') as writer:
+            for line_num, sample in enumerate(tqdm(reader)):
+                text = sample[text_key]
+                images = sample[image_key]
+
+                # skip empty samples
+                if len(text) == 0:
+                    continue
+
+                # image_infos are kept or not?
+                image_infos = []
+                ori_image_infos = []
+                if 'image_info' in sample:
+                    ori_image_infos = sample['image_info']
+
+                # Only keep those image_infos that are still contained by
+                # processed images.
+                for processed_img in images:
+                    found = False
+                    for img in ori_image_infos:
+                        img_name = img['image_name']
+                        if processed_img.endswith(img_name):
+                            found = True
+                            # update to new image name
+                            img['image_name'] = processed_img
+                            image_infos.append(img)
+                            break
+                    if not found:
+                        image_infos.append({
+                            'image_name': processed_img,
+                        })
+
+                # split text into a list of several sentences (chunks)
+                # remove empty chunks (e.g. the last chunk '' after eoc)
+                chunks = [
+                    sent.strip() for sent in text.split(eoc_special_token)
+                    if sent.strip()
+                ]
+
+                # construct text_list and update matched_text_index for the
+                # final image_infos
+                sentences = []
+                curr_image_idx = 0
+                for text_idx, sent in enumerate(chunks):
+                    # remove possible sentence seperator
+                    if sent.endswith(sent_seperator):
+                        sent = sent[:-len(sent_seperator)].strip()
+                    if sent.startswith(sent_seperator):
+                        sent = sent[len(sent_seperator):].strip()
+
+                    # remove possible image_special_token and update
+                    # matched_text_index for corresponding image_info
+                    found_image = False
+                    if sent.startswith(image_special_token):
+                        sent = sent[len(image_special_token):].strip()
+                        found_image = True
+                        if sent.startswith(sent_seperator):
+                            sent = sent[len(sent_seperator):].strip()
+                    elif sent.endswith(image_special_token):
+                        sent = sent[:-len(image_special_token)].strip()
+                        found_image = True
+                        if sent.endswith(sent_seperator):
+                            sent = sent[:-len(sent_seperator)].strip()
+                    sentences.append(sent)
+                    if found_image:
+                        if curr_image_idx < len(image_infos):
+                            image_infos[curr_image_idx][
+                                'matched_text_index'] = text_idx
+                            curr_image_idx += 1
+                        else:
+                            # if there are extra images, just skip them and
+                            # report a warning
+                            logger.warning(f'Sample with line number '
+                                           f'[{line_num}] contains unaligned '
+                                           f'numbers of images and image '
+                                           f'tokens. Please check and retry '
+                                           f'if needed.')
+
+                # convert image_name to relative paths
+                if convert_to_relative_paths:
+                    for idx in range(len(image_infos)):
+                        img_name = image_infos[idx]['image_name']
+                        if img_name.startswith(original_mmc4_ds_path):
+                            image_infos[idx]['image_name'] = os.path.relpath(
+                                img_name, original_mmc4_ds_path)
+                        else:
+                            raise ValueError(
+                                f'The original_mmc4_ds_path '
+                                f'[{original_mmc4_ds_path}] is not the '
+                                f'directory that contains the image '
+                                f'[{img_name}] in the sample of line number '
+                                f'[{line_num}]. Please check if the correct '
+                                f'original_llava_ds_path is provided or '
+                                f'something wrong with this sample, and try '
+                                f'again later.')
+
+                # reorder image_info to the same order as the original dataset
+                final_image_info = []
+                for img in ori_image_infos:
+                    img_name = img['image_name']
+                    for processed_img in image_infos:
+                        processed_img_name = processed_img['image_name']
+                        if processed_img_name.endswith(img_name):
+                            final_image_info.append(processed_img)
+                            break
+
+                # construct the new sample structure
+                new_sample = deepcopy(sample)
+                new_sample['image_info'] = final_image_info
+                new_sample['text_list'] = sentences
+                if not keep_dj_fields:
+                    _ = new_sample.pop(image_key)
+                    _ = new_sample.pop(text_key)
+
+                writer.write(new_sample)
+
+    logger.info(f'Store the target dataset into [{target_mmc4_ds_path}].')
+
+
+if __name__ == '__main__':
+    fire.Fire(main)
diff --git a/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py b/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
index 58007abf2..a911702f1 100644
--- a/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
+++ b/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
@@ -136,7 +136,7 @@ def main(
         if image_broadcast_pos not in ['random', 'before', 'after', 'follow']:
             raise ValueError(f'Arg image_broadcast_pos should be one of ['
                              f'"random", "before", "after", "follow"], but '
-                             f'given [{image_broadcast_pos}]')
+                             f'given [{image_broadcast_pos}].')
     # check if the default image special token is changed
     if image_special_token != '<image>':
         logger.warning('The image_special_token used in the original LLaVA '
@@ -254,7 +254,7 @@ def main(
             join_sep = sent_seperator
             if split_chunk:
                 # split (human, robot) pairs into several chunks
-                join_sep = f' {eoc_special_token} ' + join_sep
+                join_sep = f' {eoc_special_token}' + join_sep
             text = join_sep.join(formatted_conversations)
             if add_eoc_at_last:
                 # add an extra eoc token after the whole sample text
diff --git a/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py b/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py
new file mode 100644
index 000000000..139f62431
--- /dev/null
+++ b/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py
@@ -0,0 +1,255 @@
+# This tool is used to convert multimodal dataset in MMC4 format to a target
+# dataset in Data-Juicer format.
+#
+# MMC4 format:
+#   - interleaved image-text sequence
+#   - in jsonl
+# {'image_info': [{'face_detections': None,
+#                  'image_name': 'b9040a0dbb22.jpg',
+#                  'matched_sim': 0.27694183588027954,
+#                  'matched_text_index': 2,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},  # noqa: E501
+#                 {'face_detections': None,
+#                  'image_name': 'db1c21bc8474.jpg',
+#                  'matched_sim': 0.3234919607639313,
+#                  'matched_text_index': 1,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],  # noqa: E501
+#  'similarity_matrix': [[0.24363446235656738,
+#                         0.31758785247802734,
+#                         0.27694183588027954],
+#                        [0.2233106791973114,
+#                         0.3234919607639313,
+#                         0.26118797063827515]],
+#  'text_list': ['When you lock the door using the lock tab on the driver’s '
+#                'door, all of the other doors and tailgate lock at the same '
+#                'time.',
+#                'Press the master door lock switch in as shown to lock or '
+#                'unlock all doors and the tailgate.',
+#                'When you lock/unlock the driver’s door and tailgate using the '  # noqa: E501
+#                'master lock switch, all the other doors lock/ unlock at the '
+#                'same time.'],
+#  'url': 'http://www.hfitinfo.com/hofi-48.html',
+#  'could_have_url_duplicate': 0 }
+#
+# Corresponding Data-Juicer format:
+#   - two new fields are added:
+#       - text: multi-chunk interleaved image-text sequence in one string. Each
+#           sentence in the original dataset is a chunk in this text string.
+#       - images: image paths list
+#   - other fields in the original format can be kept or not
+#   - in jsonl
+# {'text': 'When you lock the door using the lock tab on the driver’s door, '
+#          'all of the other doors and tailgate lock at the same time. '
+#          '<|__dj__eoc|> <__dj__image> Press the master door lock switch in '
+#          'as shown to lock or unlock all doors and the tailgate. '
+#          '<|__dj__eoc|> <__dj__image> When you lock/unlock the driver’s '
+#          'door and tailgate using the master lock switch, all the other '
+#          'doors lock/ unlock at the same time. <|__dj__eoc|>',
+#  'images': ['db1c21bc8474.jpg', 'b9040a0dbb22.jpg'],
+#  'image_info': [{'face_detections': None,
+#                  'image_name': 'b9040a0dbb22.jpg',
+#                  'matched_sim': 0.27694183588027954,
+#                  'matched_text_index': 2,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},  # noqa: E501
+#                 {'face_detections': None,
+#                  'image_name': 'db1c21bc8474.jpg',
+#                  'matched_sim': 0.3234919607639313,
+#                  'matched_text_index': 1,
+#                  'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],  # noqa: E501
+#  'similarity_matrix': [[0.24363446235656738,
+#                         0.31758785247802734,
+#                         0.27694183588027954],
+#                        [0.2233106791973114,
+#                         0.3234919607639313,
+#                         0.26118797063827515]],
+#  'text_list': ['When you lock the door using the lock tab on the driver’s '
+#                'door, all of the other doors and tailgate lock at the same '
+#                'time.',
+#                'Press the master door lock switch in as shown to lock or '
+#                'unlock all doors and the tailgate.',
+#                'When you lock/unlock the driver’s door and tailgate using the '  # noqa: E501
+#                'master lock switch, all the other doors lock/ unlock at the '
+#                'same time.'],
+#  'url': 'http://www.hfitinfo.com/hofi-48.html',
+#  'could_have_url_duplicate': 0 }
+#
+# Reference:
+# https://github.com/allenai/mmc4#documents
+
+import os
+import random
+from copy import deepcopy
+
+import fire
+import jsonlines as jl
+from loguru import logger
+from tqdm import tqdm
+
+from data_juicer.utils.mm_utils import SpecialTokens
+
+
+@logger.catch
+def main(
+    mmc4_ds_path: str,
+    target_ds_path: str,
+    image_dir: str = None,
+    eoc_special_token: str = SpecialTokens.eoc,
+    image_special_token: str = SpecialTokens.image,
+    image_special_token_insert_pos: str = 'before',
+    add_eoc_at_last: bool = True,
+    sent_seperator: str = ' ',
+    keep_other_fields: bool = True,
+):
+    """
+    Convert a MMC4-like dataset to the Data-Juicer format.
+
+    :param mmc4_ds_path: path to the input MMC4-like dataset.
+    :param target_ds_path: path to store the converted dataset in Data-Juicer
+        format.
+    :param image_dir: directory to store images. If it's None, it means the
+        "image_name" for each image includes this information already. Default:
+        None.
+    :param eoc_special_token: the special token for "end of a chunk". It's used
+        to split sentence chunks explicitly. Default: <|__dj__eoc|> (from
+        Data-Juicer).
+    :param image_special_token: the special token for images. It's used to
+        locate the images in the conversation. In typical MMC4-like datasets,
+        this special token is not specified. So we simply use the default image
+        special token from our Data-Juicer. Default: <__dj__image> (from
+        Data-Juicer).
+    :param image_special_token_insert_pos: the position in the sentence to
+        insert the corresponding image special token. Should be one of: [
+        "before", "after", "random"]. Default: "before", which is aligned with
+        Flamingo format.
+    :param add_eoc_at_last: whether to add an extra eoc_special_token at the
+        end of text. Default: True.
+    :param sent_seperator: seperator to split different sentences. Default: " "
+    :param keep_other_fields: whether to keep other fields in the original
+        datasets. Default: False.
+    """
+    # ----- Constant settings. Better not to change them. -----
+    # default key of field to store the sample text
+    text_key = 'text'
+    # default key of field to store the image list
+    image_key = 'images'
+    # required fields in the original dataset
+    REQUIRED_FIELDS = {'image_info', 'text_list'}
+    # ----- Constant settings. Better not to change them. -----
+
+    # check arguments
+    # check paths
+    if not os.path.exists(mmc4_ds_path):
+        raise FileNotFoundError(f'Input MMC4 dataset [{mmc4_ds_path}] can '
+                                f'not be found.')
+    if not target_ds_path.endswith('.jsonl'):
+        raise ValueError('Only support "jsonl" target dataset file now.')
+    if os.path.dirname(target_ds_path) \
+            and not os.path.exists(os.path.dirname(target_ds_path)):
+        logger.info(f'Create directory [{os.path.dirname(target_ds_path)}] '
+                    f'for the target dataset.')
+        os.makedirs(os.path.dirname(target_ds_path))
+    # check image dir
+    if not image_dir:
+        image_dir = ''
+    # check insert position
+    if image_special_token_insert_pos not in ['random', 'before', 'after']:
+        raise ValueError(f'Arg image_special_token_insert_pos should be one '
+                         f'of ["before", "after", "random"], but given '
+                         f'[{image_special_token_insert_pos}]')
+    # check whether to add the eoc special token at last
+    if not add_eoc_at_last:
+        logger.warning('You choose not to add special eoc token at the last, '
+                       'which might cause some compatibility problems for '
+                       'other type of datasets (e.g. OpenFlamingo).')
+    if not keep_other_fields:
+        logger.warning('You choose not to keep other fields in the original '
+                       'dataset. Thus some information might be lost in the '
+                       'processed anc converted-back dataset!')
+
+    # load MMC4 dataset
+    logger.info('Start converting the original MMC4 dataset...')
+    # record the failed samples: (line_number, fail_reason_info)
+    failed_samples = []
+    with jl.open(mmc4_ds_path, 'r') as reader:
+        with jl.open(target_ds_path, 'w') as writer:
+            for line_num, sample in enumerate(tqdm(reader)):
+                # check required fields
+                fields_ok = True
+                for key in REQUIRED_FIELDS:
+                    if key not in sample:
+                        failed_samples.append((
+                            line_num,
+                            f'There is no key [{key}] in the sample whose line'
+                            f' number is [{line_num}], which is required for '
+                            f'MMC4-like dataset conversion.'))
+                        fields_ok = False
+                        break
+                if not fields_ok:
+                    continue
+
+                new_sample = {}
+                if keep_other_fields:
+                    # if other fields need to be kept, initialize the new
+                    # sample with the original sample
+                    new_sample = deepcopy(sample)
+
+                # convert text_list and image_info to text and images
+                image_infos = sample['image_info']
+                sentences = sample['text_list']
+
+                # sort image infos by their matched_text_index
+                image_infos.sort(key=lambda s: s['matched_text_index'])
+
+                # get the image path list directly
+                images = [
+                    os.path.join(image_dir, s['image_name'])
+                    for s in image_infos
+                ]
+
+                # construct the text string in Data-Juicer format
+                img_idx = 0
+                new_sents = []
+                for sent_idx, sent in enumerate(sentences):
+                    if img_idx < len(image_infos) and image_infos[img_idx][
+                            'matched_text_index'] == sent_idx:
+                        # find the matched sentence of the current image,
+                        # insert a image_special_token to specific position.
+                        if image_special_token_insert_pos == 'before':
+                            sent = image_special_token + sent_seperator + sent
+                        elif image_special_token_insert_pos == 'after':
+                            sent += sent_seperator + image_special_token
+                        else:
+                            if random.random() < 0.5:
+                                # before
+                                sent = image_special_token + sent_seperator \
+                                       + sent
+                            else:
+                                # after
+                                sent += sent_seperator + image_special_token
+                        # check the next img_idx
+                        img_idx += 1
+                    new_sents.append(sent)
+
+                join_sep = f' {eoc_special_token}{sent_seperator}'
+                text = join_sep.join(new_sents)
+                if add_eoc_at_last:
+                    text += f' {eoc_special_token}'
+
+                # construct the new sample
+                new_sample[image_key] = images
+                new_sample[text_key] = text
+
+                writer.write(new_sample)
+    logger.info(f'Store the target dataset into [{target_ds_path}].')
+    if len(failed_samples) > 0:
+        failed_samples_path = target_ds_path + '_failed.txt'
+        logger.warning(f'[{len(failed_samples)} samples fail to be converted, '
+                       f'whose line number and failed reasons are store in '
+                       f'[{failed_samples_path}]')
+        with open(failed_samples_path, 'w') as fout:
+            for line_num, reason in failed_samples:
+                fout.write(f'{line_num}\t{reason}\n')
+
+
+if __name__ == '__main__':
+    fire.Fire(main)