Skip to content

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

Compare
Choose a tag to compare
@HYLcool HYLcool released this 05 Jan 09:31
· 192 commits to main since this release
a3c8310

New Features

  • Data-Juicer now supports Python3.7-3.10!
    • We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
    • We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
  • Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
    • A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
    • Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
    • Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
  • Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
  • Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
  • Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160

New OPs

Text

  • chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
  • remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
  • text_action_filter: keeps samples containing action verbs in their texts. #122
  • text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
  • replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
  • remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149

Image

  • image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
  • image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
  • image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
  • face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
  • image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72

Multimodal

  • image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
  • image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
  • phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139

Bugs fixed

  • Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
  • Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
  • Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
  • Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
  • Fix the bug of commandline arguments parsing error in some cases. #108 #165
  • Store simhash value as string type to avoid errors from PyArrow. #168 #170

Others

  • Dependency importing optimization: only require and import some dependencies when using. #35 #82
  • Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
  • Optimize the cache directory selection logic. #43
  • Support limiting the number of samples when mixing datasets. #86
  • Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
  • OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151

Acknowledgement

Here we thank public contributors for their PRs to make Data-Juicer better!