New Features

🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
video_resolution_filter: filters samples according to the resolution of videos in them. #227
video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

video_split_by_scene_mapper: splits videos into scene clips. #227
video_split_by_duration_mapper: splits videos by specified duration interval. #227
video_split_by_key_frame_mapper: splits videos by their keyframes. #227
video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

image_blur_mapper: adds random noises to images to blur them. #180
image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the predict.py tool gets stuck on the help page. #183
Fix face_area_filter: constrains the detection coordinates within the image. #202
Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219

Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227

Here we thank public contributors for their PRs to make Data-Juicer better!

@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226