Doc/video recipe example (#308)

* + add a recipe example for video processing + add more details in docs for Docker image * + add docs about this example recipe
modelscope · Apr 30, 2024 · 449475b · 449475b
1 parent f142c2e
commit 449475b
Show file tree

Hide file tree

Showing 6 changed files with 89 additions and 0 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -16,6 +16,9 @@ ENV JAVA_HOME=/opt/jdk
 
 WORKDIR /data-juicer
 
+# install requirements which need to be installed from source
+RUN pip install git+https://github.com/xinyu1205/recognize-anything.git --default-timeout 1000
+
 # install requirements first to better reuse installed library cache
 COPY environments/ environments/
 RUN cat environments/* | xargs pip install --default-timeout 1000

diff --git a/README.md b/README.md
@@ -214,6 +214,8 @@ pip install py-data-juicer
     ```shell
     docker build -t datajuicer/data-juicer:<version_tag> .
     ```
+
+  - The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.
 
 ### Installation check
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -193,6 +193,8 @@ pip install py-data-juicer
     ```shell
     docker build -t datajuicer/data-juicer:<version_tag> .
     ```
+
+  - `<version_tag>`的格式类似于`v0.2.0`，与发布（Release）的版本号相同。
 
 ### 安装校验
 

diff --git a/configs/data_juicer_recipes/README.md b/configs/data_juicer_recipes/README.md
@@ -49,3 +49,11 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | LLaVA-1.5-13B <br> (baseline) | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
 | LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
+
+## For Video Dataset
+
+We provide a video dataset processing recipe example for users to better utilize video-related OPs in [general-video-refine-example.yaml](general-video-refine-example.yaml). Here we apply three types of OPs:
+- Text-Only: to improve the dataset quality according to the video captions.
+- Video-Only: to improve the dataset quality according to the video features.
+- Text-Video: to improve the dataset quality according to the alignment between text and videos.
+Users can start to process their video datasets based on this recipe.
diff --git a/configs/data_juicer_recipes/README_ZH.md b/configs/data_juicer_recipes/README_ZH.md
@@ -49,3 +49,12 @@
 |---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
 | LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
+
+## 视频数据集
+
+我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子： [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子：
+- 仅文本：根据视频描述提高数据集质量
+- 仅视频：根据视频性质提高数据集质量
+- 文本-视频：根据文本和视频间的对齐提高数据集质量
+用户可以基于这个菜谱开始他们的视频数据集处理流程。
+- 
diff --git a/configs/data_juicer_recipes/general-video-refine-example.yaml b/configs/data_juicer_recipes/general-video-refine-example.yaml
@@ -0,0 +1,65 @@
+# Process config example including:
+#   - all global arguments
+#   - all ops and their arguments
+
+# global parameters
+project_name: 'all'                                         # project name for distinguish your configs
+dataset_path: '/path/to/a/video-text/dataset.jsonl'
+                                                            # accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
+export_path: '/path/to/store/refined/dataset.jsonl'
+np: 48                                                       # number of subprocess to process your dataset
+                                                            # Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
+open_tracer: true                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
+
+# for multimodal data processing
+video_key: 'videos'                                         # key name of field to store the list of sample video paths.
+video_special_token: '<__dj__video>'                        # the special token that represents a video in the text. In default, it's "<__dj__video>". You can specify your own special token according to your input dataset.
+
+eoc_special_token: '<|__dj__eoc|>'                          # the special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset.
+
+# process schedule: a list of several process operators with their arguments
+# hyperparameters are set according to the 3-sigma stats on MSR-VTT dataset
+process:
+  - language_id_score_filter:                               # filter text in specific language with language scores larger than a specific max value
+      lang: en                                                # keep text in what language
+      min_score: 0.26311219                                   # the min language scores to filter text
+  - perplexity_filter:                                      # filter text with perplexity score out of specific range
+      lang: en                                                # compute perplexity in what language
+      max_ppl: 7376.81378                                     # the max perplexity score to filter text
+  - video_aesthetics_filter:                                # filter samples according to the aesthetics score of frame images extracted from videos.
+      hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE # Huggingface model name for the aesthetics predictor
+      min_score: 0.31767486                                   # the min aesthetics score of filter range
+      max_score: 1.0                                          # the max aesthetics score of filter range
+      frame_sampling_method: 'uniform'                        # sampling method of extracting frame images from the videos. Should be one of ["all_keyframe", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "uniform" with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.
+      frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
+      reduce_mode: avg                                        # reduce mode to the all frames extracted from videos, must be one of ['avg','max', 'min'].
+      any_or_all: any                                         # keep this sample when any/all images meet the filter condition
+  - video_frames_text_similarity_filter:                    # keep samples those similarities between sampled video frame images and text within a specific range.
+      hf_clip: openai/clip-vit-base-patch32                   # clip model name on huggingface to compute the similarity between frame image and text. It's kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.
+      min_score: 0.16571071                                   # the min similarity to keep samples.
+      max_score: 1.0                                          # the max similarity to keep samples.
+      frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
+      frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
+      horizontal_flip: false                                  # flip frame image horizontally (left to right).
+      vertical_flip: false                                    # flip frame image vertically (top to bottom).
+      reduce_mode: avg                                        # reduce mode when one text corresponds to multiple videos in a chunk,  must be one of ['avg','max', 'min'].
+      any_or_all: any                                         # keep this sample when any/all videos meet the filter condition
+  - video_motion_score_filter:                              # Keep samples with video motion scores within a specific range.
+      min_score: 0.25                                         # the minimum motion score to keep samples
+      max_score: 10000.0                                      # the maximum motion score to keep samples
+      sampling_fps: 2                                         # the samplig rate of frames_per_second to compute optical flow
+      any_or_all: any                                         # keep this sample when any/all videos meet the filter condition
+  - video_nsfw_filter:                                      # filter samples according to the nsfw scores of videos in them
+      hf_nsfw_model: Falconsai/nsfw_image_detection           # Huggingface model name for nsfw classification
+      score_threshold: 0.34847191                             # the nsfw score threshold for samples, range from 0 to 1
+      frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
+      frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
+      reduce_mode: avg                                        # reduce mode for multiple sampled video frames to compute nsfw scores of videos, must be one of ['avg','max', 'min'].
+      any_or_all: any                                         # keep this sample when any/all images meet the filter condition
+  - video_watermark_filter:                                 # filter samples according to the predicted watermark probabilities of videos in them
+      hf_watermark_model: amrul-hzz/watermark_detector        # Huggingface model name for watermark classification
+      prob_threshold: 0.96510297                              # the predicted watermark probability threshold for samples, range from 0 to 1
+      frame_sampling_method: all_keyframes                    # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
+      frame_num: 3                                            # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
+      reduce_mode: avg                                        # reduce mode for multiple sampled video frames to compute final predicted watermark probabilities of videos, must be one of ['avg','max', 'min'].
+      any_or_all: any                                         # keep this sample when any/all images meet the filter condition