Release v0.2.0: Multimodal Support & DJ-SORA
New Features
- ๐ We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
- ๐ We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
- ๐ฅ Our paper has been accepted by SIGMOD'24 industrial track! #211
- ๐ฅ "BetterMixture" โ Our second data-centric LLM competition has kicked off and is about to end soon. #174
New OPs
Multimodal
video_frames_text_similarity_filter
: keeps samples whose similarities between sampled video frame images and text within a specific range. #227video_tagging_from_frames_mapper
: generates video tags from frames extracted from the video. #227video_tagging_from_audio_mapper
: generates video tags from audio streams extracted from videos. #227video_captioning_from_video_mapper
: generates captions from frame images extracted from video to augment datasets. #227video_captioning_from_audio_mapper
: captions a video according to its audio streams. #227image_captioning_mapper
: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227image_captioning_from_gpt4v_mapper
: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227image_diffusion_mapper
: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200
Video
Filter
video_duration_filter
: keeps samples whose videos' durations are within a specified range. #227video_aspect_ratio_filter
: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227video_resolution_filter
: filters samples according to the resolution of videos in them. #227video_ocr_area_ratio_filter
: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227video_aesthetics_filter
: filters samples according to the aesthetics score of frame images extracted from videos. #227video_motion_score_filter
: keeps samples with video motion scores within a specific range. #227
Mapper
video_split_by_scene_mapper
: splits videos into scene clips. #227video_split_by_duration_mapper
: splits videos by specified duration interval. #227video_split_by_key_frame_mapper
: splits videos by their keyframes. #227video_resize_aspect_ratio_mapper
: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227video_resize_resolution_mapper
: maps videos to ones with a given resolution range. #227video_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to video data more conveniently. #227
Deduplicator
video_deduplicator
: deduplicates samples at document-level using exact matching of videos between documents. #227
Audio
audio_duration_filter
: keeps samples whose audios' durations are within a specified range. #177audio_size_filter
: keeps samples whose audios' sizes are within a specified range. #184audio_nmf_snr_filter
: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189audio_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to audio data more conveniently. #227
Image
image_blur_mapper
: adds random noises to images to blur them. #180image_aesthetics_filter
: filter samples according to the aesthetics scores of images. #227
Document Updates
- "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
- Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
- Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
- OP Insight Visualization Demo code: adds a demo to visualize how each OP works.
Bugs Fixed
- Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
- Fix the bug that some images will be lost when converting their paths to absolute paths. #178
- Fix the dependency problems of OPs who depend on other OPs. #181
- Fix the bug that the
predict.py
tool gets stuck on the help page. #183 - Fix
face_area_filter
: constrains the detection coordinates within the image. #202 - Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
- Fix or update invalid links in Data-Juicer. #201 #219
Others
- Optimize the model management module. #196 #227
- Optimize the unit test actions. #195 #196 #216 #227
- Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
- Update the docker image with JDK. #208
- Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
- Optimize the generated multimodal data storage. #227
- Support running data-juicer process jobs on Aliyun PAI-DLC. #227
- Better support for multi-machine distributed data processing in Ray mode. #227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!