20 Dec 12:15

yxdyc

a26dcc7

Release v1.0.2 Latest

Latest

Major Updates

Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

extract_support_text_mapper, relation_identity_mapper, python_file_mapper, #500
naive_grouper, key_value_grouper, #500
nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, #500
video_extract_frames_mapper, #507

Performance

Optimize ray mode performance, #442
Patch for Performance Benchmark in CI/CD workflows, #506
DJ Ray mode supports streaming loading of jsonl files, #515

Usability and Analysis

support dj-install in recipe-level, #508
support dj-analyze with --auto mode, #512
support op-wise insight auto mining, #516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

Assets 3

06 Dec 09:09

BeachWang

v1.0.1

9f1b0c8

Release v1.0.1

Major Updates

🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491

Script OPs

python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

Add an argument to control whether to open Monitor for data processing. It's True by default. #483
For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

Pin the PyAV version to prevent inconsistent updates. #504
Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
Remove unnecessary UNFORKABLE marks for some OPs. #491
Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!

Assets 3

22 Nov 02:50

yxdyc

v1.0.0

9caaaa9

Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

🚀 Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
🧪 [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
🚀 Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
🚀 Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
💥 We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
💥 A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
🛝 A playground for Data-Juicer is opened for user trial. #277 #368

OPs

Text

ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263
Support sentencepiece tokenizer for MinHash deduplicators. #269
generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454
generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454
optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454
optimize_query_mapper: optimizes the query in question-answering samples. #338 #454
optimize_response_mapper: optimizes the response in question-answering samples. #454
calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463
calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463
calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463
text_chunk_mapper: splits input text to chunks. #481
extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481
extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481
extract_event_mapper: extracts events and relevant characters in the text. #481
extract_keyword_mapper: generates keywords for the text. #481
extract_nickname_mapper: extracts nickname relationship in the text.. #481

Image

image_face_blur_mapper: blurs faces detected in images. #249
image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252
image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256
ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263
image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393
image_tagging_mapper: generates image tags from the input images. #423
image_face_count_filter: keeps samples containing images with face counts within the specified range. #446

Video

video_face_blur_mapper: blurs faces detected in videos. #253
video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236
video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252
video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256
ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263
video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260
video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257
video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250
video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478
Enhance the video_motion_score_filter to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361

Misc.

Switch face detection used in 3 OPs (image_face_ratio_filter, image_face_blur_mapper, video_face_blur_mapper) from dlib to OpenCV to avoid dependency problems. #320
Deduplicators for multimodal datasets are allowed to consider text information as well. #313
Support batched processing for some OPs. #406 #435

Others (Engine, Job Control and Tools)

Support more multimodal (video) dataset conversion tools: MSR-VTT #248
Support distributed processing script for Slurm. #242
Support Minhash-LSH deduplication tools based on Spark. #290
Enable GPU usage for Ray executor. #274
Add debug mode for Data-Juicer. #303
Add video generation tools for several metrics. #273 #312
Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
Add sampled frames from videos for video OPs to support OP fusion. #271
Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
Support turbo mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402
Update type annotations from jsonargparse to Pydantic. #422
Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
Enable unit test coverage report. #460
Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479

Document Updates

Refine documentation system based on Sphinx. #245
Regular document updates. #234 #246
Update the class importing and document building logics for better automation. #299
Reorganize the operator documents for better reading. #472

Bugs Fixed

Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
Fix some problems in demos. #244
Fix "Undefined punctuation_pattern" error in two OPs. #301
Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
Fix the bug of out-of-work type hint checking for config files. #302
Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
Fix the memory leaking of video OPs. #374
Fix the bug of two OPs (video_aesthetics_filter and image_diffusion_mapper) that can not make use of GPUs. #389
Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@chg0901 helps to fix typos in documents. #237
@lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
@shiweijiezero helps fix the bugs in updating the data keys. #300
@seanzhang-zhichen helps to support multiple patterns for replace_content_mapper. #319
@simplaj helps to fix a bug of a non-predefined attribute for video_captioning_from_summarizer_mapper. #343
@zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
@2108038773 helps to add trust_remote_code argument for some public models on HuggingFace. #382 #385
@TobyJasper helps to fix typos in documents and contribute a new OP image_face_count_filter. #392 #452
@co63oc helps to fix some typos in documents and code. #427

Contributors

co63oc, chg0901, and 7 other contributors

Assets 3

07 Mar 12:24

HYLcool

v0.2.0

156ed20

Release v0.2.0: Multimodal Support & DJ-SORA

New Features

🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter

video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
video_resolution_filter: filters samples according to the resolution of videos in them. #227
video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper

video_split_by_scene_mapper: splits videos into scene clips. #227
video_split_by_duration_mapper: splits videos by specified duration interval. #227
video_split_by_key_frame_mapper: splits videos by their keyframes. #227
video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator

video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio

audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image

image_blur_mapper: adds random noises to images to blur them. #180
image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates

"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed

Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the predict.py tool gets stuck on the help page. #183
Fix face_area_filter: constrains the detection coordinates within the image. #202
Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219

Others

Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

Contributors

co63oc, liuyanyi, and zhenqincn

Assets 3

05 Jan 09:31

HYLcool

v0.1.3

a3c8310

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160

New OPs

Text

chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
text_action_filter: keeps samples containing action verbs in their texts. #122
text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149

Image

image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72

Multimodal

image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139

Bugs fixed

Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
Fix the bug of commandline arguments parsing error in some cases. #108 #165
Store simhash value as string type to avoid errors from PyArrow. #168 #170

Others

Dependency importing optimization: only require and import some dependencies when using. #35 #82
Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
Optimize the cache directory selection logic. #43
Support limiting the number of samples when mixing datasets. #86
Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151

Acknowledgement

Here we thank public contributors for their PRs to make Data-Juicer better!

@JONGSKY helps to remove some unnecessary code. #85
@xuruidong helps to fix several broken links in the README doc. #142

Contributors

xuruidong and JONGSKY

Assets 3

28 Sep 06:32

HYLcool

v0.1.2

5bd715d

Release v0.1.2: more core functions are available now.

New OPs

nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
- Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
- A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
- Display the config table after config parsing is ready. #17

Others

Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
Docs updates. #15 #16
PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
Docker building is available now. The official docker image for Docker Hub is in progress. #23
Deploy the unit tests for Data-Juicer. #29

Assets 3

11 Aug 05:31

yxdyc

v0.1.0

d4ab729

Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

Data-Juicer: A Data-Centric Text Processing System for Large Language Models
Table of Contents
- Features
- Prerequisites
- Installation
- Quick Start
  - Data Processing
  - Data Analysis
  - Data Visualization
  - Build Up Config Files
  - Preprocess raw data (Optional)
- Documentation | 文档
- Data Recipes
- Demos
- License
- Contributing
- References

Features

Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

Assets 2

Releases: modelscope/data-juicer

Release v1.0.2

Major Updates

DJ-Operators

Performance

Usability and Analysis

Acknowledgment

Release v1.0.1

Major Updates

OPs

Text OPs

Script OPs

Bugs Fixed

Others

Acknowledgment

Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

OPs

Text

Image

Video

Misc.

Others (Engine, Job Control and Tools)

Document Updates

Bugs Fixed

Acknowledgment

Contributors

Release v0.2.0: Multimodal Support & DJ-SORA

New Features

New OPs

Multimodal

Video

Filter

Mapper

Deduplicator

Audio

Image

Document Updates

Bugs Fixed

Others

Acknowledgment

Contributors

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

New OPs

Text

Image

Multimodal

Bugs fixed

Others

Acknowledgement

Contributors

Release v0.1.2: more core functions are available now.

New OPs

New features

Others

Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

Features