diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md index b950ea08b..d4559f689 100644 --- a/tools/multimodal/README.md +++ b/tools/multimodal/README.md @@ -18,6 +18,7 @@ For now, dataset formats that are supported by Data-Juicer are listed in the fol | Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. | |------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------| | LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | +| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) | | WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) | For all tools, you can run the following command to find out the usage of them: @@ -93,6 +94,19 @@ and converted datasets, so we can regard this sample is aligned with the origina ] ``` +#### MMC4-like + +The format of MMC4-like datasets are defined [here](https://github.com/allenai/mmc4#documents). Except `image_info` and `text_list`, +which are used when converting them to Data-Juicer format, there is an important field `similarity_matrix`. Similarity matrix is +a matrix of shape `len(image_info) x len(text_list)`, which means it highly depends on the numbers of images and text sentences and their +orders. + +However, when processing such datasets with Data-Juicer, images or sentences might be removed from a sample by Filters, and they could be +modified by some Mappers. Thus, after processing, this similarity matrix might be no longer aligned with `image_info` or `text_list`. +Users should be cautious about this point if you need this matrix in later usages. + +Despite these extra fields, tools for MMC4 can perfectly convert MMC4-like datasets to Data-Juicer-format datasets and convert them back~ + ### WavCaps-like The [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) is composed of four sub-datasets: [FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/) and [AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html). Each sub-dataset has different fields. For example, the 'description' field is included in SoundBible, but does not exist in AudioSet. To ensure that the different sub-datasets can be properly merged after conversion, the union of all fields from the sub-datasets is used during the wavcaps_to_dj stage, and all fields are fully retained during the dj_to_wavcaps stage. diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md index be63af955..55671e09b 100644 --- a/tools/multimodal/README_ZH.md +++ b/tools/multimodal/README_ZH.md @@ -12,9 +12,10 @@ 目前,Data-Juicer 支持的数据集格式在下面表格中列出。 -| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 | -|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------| -| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | +| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 | +|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------| +| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | +| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) | | 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) | 对于所有工具,您可以运行以下命令来了解它们的详细用法: @@ -76,6 +77,14 @@ python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --hel ] ``` +#### 类MMC4格式 + +类MMC4数据集的格式在 [这里](https://github.com/allenai/mmc4#documents) 定义。除了在转换为Data-Juicer格式时使用的`image_info`和`text_list`之外,还有一个重要的字段`similarity_matrix`,即相似度矩阵。相似度矩阵是一个形状为`len(image_info) x len(text_list)`的矩阵,这意味着它高度依赖于图像和文本句子的数量及其顺序。 + +然而,当使用Data-Juicer处理这些数据集时,图像或句子可能会被Filter算子从样本中移除,并且它们可能会被一些Mapper算子修改。因此,在处理后,这个相似度矩阵可能无法与`image_info`或`text_list`对齐。如果用户在后续使用中需要这个矩阵,那您应该注意到这一点。 + +除了这些额外字段外,针对类MMC4格式的工具可以完美地将类MMC4格式的数据集转换为Data-Juicer格式的数据集,并将它们转换回去~ + #### 类WavCaps格式 [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) 数据集由 [FreeSound](https://freesound.org/),[BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/),[AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html) 四个子数据集组成,每个数据集里都有不同的字段。例如SoundBible里包含了‘description’字段,而该字段在AudioSet里并不存在。为了保证不同子数据集在转换后能够正常合并,在wavcaps_to_dj阶段使用了所有子数据集字段的并集,并在dj_to_wavcaps阶段完整保留了所有字段。 ```json diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py index c58a06604..cb9cf7a42 100644 --- a/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py +++ b/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py @@ -1,5 +1,5 @@ # This tool is used to convert multimodal dataset in Data-Juicer format to a -# target dataset in LLaVA format. +# target dataset in LLaVA-like format. # # Corresponding Data-Juicer format: # - multi-chunk interleaved image-text sequence @@ -101,7 +101,7 @@ def main( extra argument original_llava_ds_path is required. When the processed and converted dataset will be used in another machine, it's better to set this argument to True. Default: False. - :param original_llava_ds_path: path to the original unprocessed llava + :param original_llava_ds_path: path to the original unprocessed LLaVA dataset, which is used to help to recover the relative image paths for better migration. Default: None. """ diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py new file mode 100644 index 000000000..aba76dfa3 --- /dev/null +++ b/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py @@ -0,0 +1,304 @@ +# This tool is used to convert multimodal dataset in Data-Juicer format to a +# target dataset in MMC4 format. Notice: if the similarity matrix is included +# in the dataset, it might not be able to be restored to the original +# correlation and could be with wrong shape due to some images or text +# sentences might be removed. So this tool will do nothing to the similarity +# matrix. +# +# MMC4 in Data-Juicer format: +# - two new fields are added: +# - text: multi-chunk interleaved image-text sequence in one string. Each +# sentence in the original dataset is a chunk in this text string. +# - images: image paths list +# - other fields in the original format can be kept or not +# - in jsonl +# {'text': 'When you lock the door using the lock tab on the driver’s door, ' +# 'all of the other doors and tailgate lock at the same time. ' +# '<|__dj__eoc|> <__dj__image> Press the master door lock switch in ' +# 'as shown to lock or unlock all doors and the tailgate. ' +# '<|__dj__eoc|> <__dj__image> When you lock/unlock the driver’s ' +# 'door and tailgate using the master lock switch, all the other ' +# 'doors lock/ unlock at the same time. <|__dj__eoc|>', +# 'images': ['db1c21bc8474.jpg', 'b9040a0dbb22.jpg'], +# 'image_info': [{'face_detections': None, +# 'image_name': 'b9040a0dbb22.jpg', +# 'matched_sim': 0.27694183588027954, +# 'matched_text_index': 2, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'}, # noqa: E501 +# {'face_detections': None, +# 'image_name': 'db1c21bc8474.jpg', +# 'matched_sim': 0.3234919607639313, +# 'matched_text_index': 1, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}], # noqa: E501 +# 'similarity_matrix': [[0.24363446235656738, +# 0.31758785247802734, +# 0.27694183588027954], +# [0.2233106791973114, +# 0.3234919607639313, +# 0.26118797063827515]], +# 'text_list': ['When you lock the door using the lock tab on the driver’s ' +# 'door, all of the other doors and tailgate lock at the same ' +# 'time.', +# 'Press the master door lock switch in as shown to lock or ' +# 'unlock all doors and the tailgate.', +# 'When you lock/unlock the driver’s door and tailgate using the ' # noqa: E501 +# 'master lock switch, all the other doors lock/ unlock at the ' +# 'same time.'], +# 'url': 'http://www.hfitinfo.com/hofi-48.html', +# 'could_have_url_duplicate': 0 } +# +# MMC4 format: +# - interleaved image-text sequence +# - in jsonl +# - extra information except "image_name", "matched_text_index", "text_list" +# will be included only if they are kept when converting the original +# MMC4 format to Data-Juicer format. (keep_other_fields is True) +# {'image_info': [{'face_detections': None, +# 'image_name': 'b9040a0dbb22.jpg', +# 'matched_sim': 0.27694183588027954, +# 'matched_text_index': 2, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'}, # noqa: E501 +# {'face_detections': None, +# 'image_name': 'db1c21bc8474.jpg', +# 'matched_sim': 0.3234919607639313, +# 'matched_text_index': 1, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}], # noqa: E501 +# 'similarity_matrix': [[0.24363446235656738, +# 0.31758785247802734, +# 0.27694183588027954], +# [0.2233106791973114, +# 0.3234919607639313, +# 0.26118797063827515]], +# 'text_list': ['When you lock the door using the lock tab on the driver’s ' +# 'door, all of the other doors and tailgate lock at the same ' +# 'time.', +# 'Press the master door lock switch in as shown to lock or ' +# 'unlock all doors and the tailgate.', +# 'When you lock/unlock the driver’s door and tailgate using the ' # noqa: E501 +# 'master lock switch, all the other doors lock/ unlock at the ' +# 'same time.'], +# 'url': 'http://www.hfitinfo.com/hofi-48.html', +# 'could_have_url_duplicate': 0 } +# +# Reference: +# https://github.com/allenai/mmc4#documents + +import os +from copy import deepcopy + +import fire +import jsonlines as jl +from loguru import logger +from tqdm import tqdm + +from data_juicer.utils.mm_utils import SpecialTokens + + +@logger.catch +def main( + dj_ds_path: str, + target_mmc4_ds_path: str, + eoc_special_token: str = SpecialTokens.eoc, + image_special_token: str = SpecialTokens.image, + sent_seperator: str = ' ', + keep_dj_fields: bool = False, + convert_to_relative_paths: bool = False, + original_mmc4_ds_path: str = None, +): + """ + Convert a Data-Juicer-format dataset to an MMC4-like format. Notice: if + the similarity matrix is included in the dataset, it might not be able to + be restored to the original correlation and could be with wrong shape due + to some images or text sentences might be removed. So this tool will do + nothing to the similarity matrix. + + :param dj_ds_path: path to the input dataset in Data-Juicer format. + :param target_mmc4_ds_path: path to store the converted dataset in MMC4 + format. + :param eoc_special_token: the special token for "end of a chunk". It's used + to split sentence chunks explicitly. Default: <|__dj__eoc|> (from + Data-Juicer). + :param image_special_token: the special token for images. It's used to + locate the images in the conversation. In typical MMC4-like datasets, + this special token is not specified. So we simply use the default image + special token from our Data-Juicer. Default: <__dj__image> (from + Data-Juicer). + :param sent_seperator: seperator to split different sentences. Default: " " + :param keep_dj_fields: whether to keep intermediate fields from + Data-Juicer, such as "images", "text", ... Default: False. + :param convert_to_relative_paths: whether convert the image paths in this + dataset to relative paths to the original dataset. If it's True, an + extra argument original_llava_ds_path is required. When the processed + and converted dataset will be used in another machine, it's better to + set this argument to True. Default: False. + :param original_mmc4_ds_path: path to the original unprocessed MMC4 + dataset, which is used to help to recover the relative image paths for + better migration. Default: None. + """ + # ----- Constant settings. Better not to change them. ----- + # default key of field to store the sample text + text_key = 'text' + # default key of field to store the image list + image_key = 'images' + # ----- Constant settings. Better not to change them. ----- + + # check arguments + # check paths + if not os.path.exists(dj_ds_path): + raise FileNotFoundError( + f'Input dataset [{dj_ds_path}] can not be found.') + if not target_mmc4_ds_path.endswith('.jsonl'): + raise ValueError( + 'Only support "jsonl" target dataset file for MMC4 now.') + if os.path.dirname(target_mmc4_ds_path) \ + and not os.path.exists(os.path.dirname(target_mmc4_ds_path)): + logger.info( + f'Create directory [{os.path.dirname(target_mmc4_ds_path)}] for ' + f'the target dataset.') + os.makedirs(os.path.dirname(target_mmc4_ds_path)) + + # if convert_to_relative_paths is True, check if the original_llava_ds_path + # is provided as well. + if convert_to_relative_paths: + if not original_mmc4_ds_path: + raise ValueError('When convert_to_relative_paths is set to True, ' + 'the original_llava_ds_path must be provided ' + 'for recovering the relative paths. Please ' + 'check and retry.') + original_mmc4_ds_path = os.path.abspath(original_mmc4_ds_path) + # if provided original_mmc4_ds_path is the dataset file path, only + # keep the directory path. + if os.path.isfile(original_mmc4_ds_path): + original_mmc4_ds_path = os.path.dirname(original_mmc4_ds_path) + + # whether to keep dj fields + if keep_dj_fields: + logger.warning('You choose to keep intermediate fields added when ' + 'converting to Data-Juicer format, which are usually ' + 'useless in the final dataset but it will increase the ' + 'size of the whole dataset file.') + + # load MMC4 dataset + logger.info('Start converting the original dataset to MMC4 format...') + with jl.open(dj_ds_path, 'r') as reader: + with jl.open(target_mmc4_ds_path, 'w') as writer: + for line_num, sample in enumerate(tqdm(reader)): + text = sample[text_key] + images = sample[image_key] + + # skip empty samples + if len(text) == 0: + continue + + # image_infos are kept or not? + image_infos = [] + ori_image_infos = [] + if 'image_info' in sample: + ori_image_infos = sample['image_info'] + + # Only keep those image_infos that are still contained by + # processed images. + for processed_img in images: + found = False + for img in ori_image_infos: + img_name = img['image_name'] + if processed_img.endswith(img_name): + found = True + # update to new image name + img['image_name'] = processed_img + image_infos.append(img) + break + if not found: + image_infos.append({ + 'image_name': processed_img, + }) + + # split text into a list of several sentences (chunks) + # remove empty chunks (e.g. the last chunk '' after eoc) + chunks = [ + sent.strip() for sent in text.split(eoc_special_token) + if sent.strip() + ] + + # construct text_list and update matched_text_index for the + # final image_infos + sentences = [] + curr_image_idx = 0 + for text_idx, sent in enumerate(chunks): + # remove possible sentence seperator + if sent.endswith(sent_seperator): + sent = sent[:-len(sent_seperator)].strip() + if sent.startswith(sent_seperator): + sent = sent[len(sent_seperator):].strip() + + # remove possible image_special_token and update + # matched_text_index for corresponding image_info + found_image = False + if sent.startswith(image_special_token): + sent = sent[len(image_special_token):].strip() + found_image = True + if sent.startswith(sent_seperator): + sent = sent[len(sent_seperator):].strip() + elif sent.endswith(image_special_token): + sent = sent[:-len(image_special_token)].strip() + found_image = True + if sent.endswith(sent_seperator): + sent = sent[:-len(sent_seperator)].strip() + sentences.append(sent) + if found_image: + if curr_image_idx < len(image_infos): + image_infos[curr_image_idx][ + 'matched_text_index'] = text_idx + curr_image_idx += 1 + else: + # if there are extra images, just skip them and + # report a warning + logger.warning(f'Sample with line number ' + f'[{line_num}] contains unaligned ' + f'numbers of images and image ' + f'tokens. Please check and retry ' + f'if needed.') + + # convert image_name to relative paths + if convert_to_relative_paths: + for idx in range(len(image_infos)): + img_name = image_infos[idx]['image_name'] + if img_name.startswith(original_mmc4_ds_path): + image_infos[idx]['image_name'] = os.path.relpath( + img_name, original_mmc4_ds_path) + else: + raise ValueError( + f'The original_mmc4_ds_path ' + f'[{original_mmc4_ds_path}] is not the ' + f'directory that contains the image ' + f'[{img_name}] in the sample of line number ' + f'[{line_num}]. Please check if the correct ' + f'original_llava_ds_path is provided or ' + f'something wrong with this sample, and try ' + f'again later.') + + # reorder image_info to the same order as the original dataset + final_image_info = [] + for img in ori_image_infos: + img_name = img['image_name'] + for processed_img in image_infos: + processed_img_name = processed_img['image_name'] + if processed_img_name.endswith(img_name): + final_image_info.append(processed_img) + break + + # construct the new sample structure + new_sample = deepcopy(sample) + new_sample['image_info'] = final_image_info + new_sample['text_list'] = sentences + if not keep_dj_fields: + _ = new_sample.pop(image_key) + _ = new_sample.pop(text_key) + + writer.write(new_sample) + + logger.info(f'Store the target dataset into [{target_mmc4_ds_path}].') + + +if __name__ == '__main__': + fire.Fire(main) diff --git a/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py b/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py index 58007abf2..a911702f1 100644 --- a/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py +++ b/tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py @@ -136,7 +136,7 @@ def main( if image_broadcast_pos not in ['random', 'before', 'after', 'follow']: raise ValueError(f'Arg image_broadcast_pos should be one of [' f'"random", "before", "after", "follow"], but ' - f'given [{image_broadcast_pos}]') + f'given [{image_broadcast_pos}].') # check if the default image special token is changed if image_special_token != '': logger.warning('The image_special_token used in the original LLaVA ' @@ -254,7 +254,7 @@ def main( join_sep = sent_seperator if split_chunk: # split (human, robot) pairs into several chunks - join_sep = f' {eoc_special_token} ' + join_sep + join_sep = f' {eoc_special_token}' + join_sep text = join_sep.join(formatted_conversations) if add_eoc_at_last: # add an extra eoc token after the whole sample text diff --git a/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py b/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py new file mode 100644 index 000000000..139f62431 --- /dev/null +++ b/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py @@ -0,0 +1,255 @@ +# This tool is used to convert multimodal dataset in MMC4 format to a target +# dataset in Data-Juicer format. +# +# MMC4 format: +# - interleaved image-text sequence +# - in jsonl +# {'image_info': [{'face_detections': None, +# 'image_name': 'b9040a0dbb22.jpg', +# 'matched_sim': 0.27694183588027954, +# 'matched_text_index': 2, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'}, # noqa: E501 +# {'face_detections': None, +# 'image_name': 'db1c21bc8474.jpg', +# 'matched_sim': 0.3234919607639313, +# 'matched_text_index': 1, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}], # noqa: E501 +# 'similarity_matrix': [[0.24363446235656738, +# 0.31758785247802734, +# 0.27694183588027954], +# [0.2233106791973114, +# 0.3234919607639313, +# 0.26118797063827515]], +# 'text_list': ['When you lock the door using the lock tab on the driver’s ' +# 'door, all of the other doors and tailgate lock at the same ' +# 'time.', +# 'Press the master door lock switch in as shown to lock or ' +# 'unlock all doors and the tailgate.', +# 'When you lock/unlock the driver’s door and tailgate using the ' # noqa: E501 +# 'master lock switch, all the other doors lock/ unlock at the ' +# 'same time.'], +# 'url': 'http://www.hfitinfo.com/hofi-48.html', +# 'could_have_url_duplicate': 0 } +# +# Corresponding Data-Juicer format: +# - two new fields are added: +# - text: multi-chunk interleaved image-text sequence in one string. Each +# sentence in the original dataset is a chunk in this text string. +# - images: image paths list +# - other fields in the original format can be kept or not +# - in jsonl +# {'text': 'When you lock the door using the lock tab on the driver’s door, ' +# 'all of the other doors and tailgate lock at the same time. ' +# '<|__dj__eoc|> <__dj__image> Press the master door lock switch in ' +# 'as shown to lock or unlock all doors and the tailgate. ' +# '<|__dj__eoc|> <__dj__image> When you lock/unlock the driver’s ' +# 'door and tailgate using the master lock switch, all the other ' +# 'doors lock/ unlock at the same time. <|__dj__eoc|>', +# 'images': ['db1c21bc8474.jpg', 'b9040a0dbb22.jpg'], +# 'image_info': [{'face_detections': None, +# 'image_name': 'b9040a0dbb22.jpg', +# 'matched_sim': 0.27694183588027954, +# 'matched_text_index': 2, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'}, # noqa: E501 +# {'face_detections': None, +# 'image_name': 'db1c21bc8474.jpg', +# 'matched_sim': 0.3234919607639313, +# 'matched_text_index': 1, +# 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}], # noqa: E501 +# 'similarity_matrix': [[0.24363446235656738, +# 0.31758785247802734, +# 0.27694183588027954], +# [0.2233106791973114, +# 0.3234919607639313, +# 0.26118797063827515]], +# 'text_list': ['When you lock the door using the lock tab on the driver’s ' +# 'door, all of the other doors and tailgate lock at the same ' +# 'time.', +# 'Press the master door lock switch in as shown to lock or ' +# 'unlock all doors and the tailgate.', +# 'When you lock/unlock the driver’s door and tailgate using the ' # noqa: E501 +# 'master lock switch, all the other doors lock/ unlock at the ' +# 'same time.'], +# 'url': 'http://www.hfitinfo.com/hofi-48.html', +# 'could_have_url_duplicate': 0 } +# +# Reference: +# https://github.com/allenai/mmc4#documents + +import os +import random +from copy import deepcopy + +import fire +import jsonlines as jl +from loguru import logger +from tqdm import tqdm + +from data_juicer.utils.mm_utils import SpecialTokens + + +@logger.catch +def main( + mmc4_ds_path: str, + target_ds_path: str, + image_dir: str = None, + eoc_special_token: str = SpecialTokens.eoc, + image_special_token: str = SpecialTokens.image, + image_special_token_insert_pos: str = 'before', + add_eoc_at_last: bool = True, + sent_seperator: str = ' ', + keep_other_fields: bool = True, +): + """ + Convert a MMC4-like dataset to the Data-Juicer format. + + :param mmc4_ds_path: path to the input MMC4-like dataset. + :param target_ds_path: path to store the converted dataset in Data-Juicer + format. + :param image_dir: directory to store images. If it's None, it means the + "image_name" for each image includes this information already. Default: + None. + :param eoc_special_token: the special token for "end of a chunk". It's used + to split sentence chunks explicitly. Default: <|__dj__eoc|> (from + Data-Juicer). + :param image_special_token: the special token for images. It's used to + locate the images in the conversation. In typical MMC4-like datasets, + this special token is not specified. So we simply use the default image + special token from our Data-Juicer. Default: <__dj__image> (from + Data-Juicer). + :param image_special_token_insert_pos: the position in the sentence to + insert the corresponding image special token. Should be one of: [ + "before", "after", "random"]. Default: "before", which is aligned with + Flamingo format. + :param add_eoc_at_last: whether to add an extra eoc_special_token at the + end of text. Default: True. + :param sent_seperator: seperator to split different sentences. Default: " " + :param keep_other_fields: whether to keep other fields in the original + datasets. Default: False. + """ + # ----- Constant settings. Better not to change them. ----- + # default key of field to store the sample text + text_key = 'text' + # default key of field to store the image list + image_key = 'images' + # required fields in the original dataset + REQUIRED_FIELDS = {'image_info', 'text_list'} + # ----- Constant settings. Better not to change them. ----- + + # check arguments + # check paths + if not os.path.exists(mmc4_ds_path): + raise FileNotFoundError(f'Input MMC4 dataset [{mmc4_ds_path}] can ' + f'not be found.') + if not target_ds_path.endswith('.jsonl'): + raise ValueError('Only support "jsonl" target dataset file now.') + if os.path.dirname(target_ds_path) \ + and not os.path.exists(os.path.dirname(target_ds_path)): + logger.info(f'Create directory [{os.path.dirname(target_ds_path)}] ' + f'for the target dataset.') + os.makedirs(os.path.dirname(target_ds_path)) + # check image dir + if not image_dir: + image_dir = '' + # check insert position + if image_special_token_insert_pos not in ['random', 'before', 'after']: + raise ValueError(f'Arg image_special_token_insert_pos should be one ' + f'of ["before", "after", "random"], but given ' + f'[{image_special_token_insert_pos}]') + # check whether to add the eoc special token at last + if not add_eoc_at_last: + logger.warning('You choose not to add special eoc token at the last, ' + 'which might cause some compatibility problems for ' + 'other type of datasets (e.g. OpenFlamingo).') + if not keep_other_fields: + logger.warning('You choose not to keep other fields in the original ' + 'dataset. Thus some information might be lost in the ' + 'processed anc converted-back dataset!') + + # load MMC4 dataset + logger.info('Start converting the original MMC4 dataset...') + # record the failed samples: (line_number, fail_reason_info) + failed_samples = [] + with jl.open(mmc4_ds_path, 'r') as reader: + with jl.open(target_ds_path, 'w') as writer: + for line_num, sample in enumerate(tqdm(reader)): + # check required fields + fields_ok = True + for key in REQUIRED_FIELDS: + if key not in sample: + failed_samples.append(( + line_num, + f'There is no key [{key}] in the sample whose line' + f' number is [{line_num}], which is required for ' + f'MMC4-like dataset conversion.')) + fields_ok = False + break + if not fields_ok: + continue + + new_sample = {} + if keep_other_fields: + # if other fields need to be kept, initialize the new + # sample with the original sample + new_sample = deepcopy(sample) + + # convert text_list and image_info to text and images + image_infos = sample['image_info'] + sentences = sample['text_list'] + + # sort image infos by their matched_text_index + image_infos.sort(key=lambda s: s['matched_text_index']) + + # get the image path list directly + images = [ + os.path.join(image_dir, s['image_name']) + for s in image_infos + ] + + # construct the text string in Data-Juicer format + img_idx = 0 + new_sents = [] + for sent_idx, sent in enumerate(sentences): + if img_idx < len(image_infos) and image_infos[img_idx][ + 'matched_text_index'] == sent_idx: + # find the matched sentence of the current image, + # insert a image_special_token to specific position. + if image_special_token_insert_pos == 'before': + sent = image_special_token + sent_seperator + sent + elif image_special_token_insert_pos == 'after': + sent += sent_seperator + image_special_token + else: + if random.random() < 0.5: + # before + sent = image_special_token + sent_seperator \ + + sent + else: + # after + sent += sent_seperator + image_special_token + # check the next img_idx + img_idx += 1 + new_sents.append(sent) + + join_sep = f' {eoc_special_token}{sent_seperator}' + text = join_sep.join(new_sents) + if add_eoc_at_last: + text += f' {eoc_special_token}' + + # construct the new sample + new_sample[image_key] = images + new_sample[text_key] = text + + writer.write(new_sample) + logger.info(f'Store the target dataset into [{target_ds_path}].') + if len(failed_samples) > 0: + failed_samples_path = target_ds_path + '_failed.txt' + logger.warning(f'[{len(failed_samples)} samples fail to be converted, ' + f'whose line number and failed reasons are store in ' + f'[{failed_samples_path}]') + with open(failed_samples_path, 'w') as fout: + for line_num, reason in failed_samples: + fout.write(f'{line_num}\t{reason}\n') + + +if __name__ == '__main__': + fire.Fire(main)