Skip to content

Commit

Permalink
Doc/video recipe example (#308)
Browse files Browse the repository at this point in the history
* + add a recipe example for video processing
+ add more details in docs for Docker image

* + add docs about this example recipe
  • Loading branch information
HYLcool authored Apr 30, 2024
1 parent f142c2e commit 449475b
Show file tree
Hide file tree
Showing 6 changed files with 89 additions and 0 deletions.
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ ENV JAVA_HOME=/opt/jdk

WORKDIR /data-juicer

# install requirements which need to be installed from source
RUN pip install git+https://github.com/xinyu1205/recognize-anything.git --default-timeout 1000

# install requirements first to better reuse installed library cache
COPY environments/ environments/
RUN cat environments/* | xargs pip install --default-timeout 1000
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,8 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.

### Installation check

Expand Down
2 changes: 2 additions & 0 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,8 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- `<version_tag>`的格式类似于`v0.2.0`,与发布(Release)的版本号相同。

### 安装校验

Expand Down
8 changes: 8 additions & 0 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,11 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

## For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in [general-video-refine-example.yaml](general-video-refine-example.yaml). Here we apply three types of OPs:
- Text-Only: to improve the dataset quality according to the video captions.
- Video-Only: to improve the dataset quality according to the video features.
- Text-Video: to improve the dataset quality according to the alignment between text and videos.
Users can start to process their video datasets based on this recipe.
9 changes: 9 additions & 0 deletions configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,12 @@
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

## 视频数据集

我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
- 仅文本:根据视频描述提高数据集质量
- 仅视频:根据视频性质提高数据集质量
- 文本-视频:根据文本和视频间的对齐提高数据集质量
用户可以基于这个菜谱开始他们的视频数据集处理流程。
-
65 changes: 65 additions & 0 deletions configs/data_juicer_recipes/general-video-refine-example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Process config example including:
# - all global arguments
# - all ops and their arguments

# global parameters
project_name: 'all' # project name for distinguish your configs
dataset_path: '/path/to/a/video-text/dataset.jsonl'
# accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: '/path/to/store/refined/dataset.jsonl'
np: 48 # number of subprocess to process your dataset
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer

# for multimodal data processing
video_key: 'videos' # key name of field to store the list of sample video paths.
video_special_token: '<__dj__video>' # the special token that represents a video in the text. In default, it's "<__dj__video>". You can specify your own special token according to your input dataset.

eoc_special_token: '<|__dj__eoc|>' # the special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset.

# process schedule: a list of several process operators with their arguments
# hyperparameters are set according to the 3-sigma stats on MSR-VTT dataset
process:
- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
lang: en # keep text in what language
min_score: 0.26311219 # the min language scores to filter text
- perplexity_filter: # filter text with perplexity score out of specific range
lang: en # compute perplexity in what language
max_ppl: 7376.81378 # the max perplexity score to filter text
- video_aesthetics_filter: # filter samples according to the aesthetics score of frame images extracted from videos.
hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE # Huggingface model name for the aesthetics predictor
min_score: 0.31767486 # the min aesthetics score of filter range
max_score: 1.0 # the max aesthetics score of filter range
frame_sampling_method: 'uniform' # sampling method of extracting frame images from the videos. Should be one of ["all_keyframe", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "uniform" with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode to the all frames extracted from videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- video_frames_text_similarity_filter: # keep samples those similarities between sampled video frame images and text within a specific range.
hf_clip: openai/clip-vit-base-patch32 # clip model name on huggingface to compute the similarity between frame image and text. It's kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.
min_score: 0.16571071 # the min similarity to keep samples.
max_score: 1.0 # the max similarity to keep samples.
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
horizontal_flip: false # flip frame image horizontally (left to right).
vertical_flip: false # flip frame image vertically (top to bottom).
reduce_mode: avg # reduce mode when one text corresponds to multiple videos in a chunk, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all videos meet the filter condition
- video_motion_score_filter: # Keep samples with video motion scores within a specific range.
min_score: 0.25 # the minimum motion score to keep samples
max_score: 10000.0 # the maximum motion score to keep samples
sampling_fps: 2 # the samplig rate of frames_per_second to compute optical flow
any_or_all: any # keep this sample when any/all videos meet the filter condition
- video_nsfw_filter: # filter samples according to the nsfw scores of videos in them
hf_nsfw_model: Falconsai/nsfw_image_detection # Huggingface model name for nsfw classification
score_threshold: 0.34847191 # the nsfw score threshold for samples, range from 0 to 1
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode for multiple sampled video frames to compute nsfw scores of videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- video_watermark_filter: # filter samples according to the predicted watermark probabilities of videos in them
hf_watermark_model: amrul-hzz/watermark_detector # Huggingface model name for watermark classification
prob_threshold: 0.96510297 # the predicted watermark probability threshold for samples, range from 0 to 1
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode for multiple sampled video frames to compute final predicted watermark probabilities of videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition

0 comments on commit 449475b

Please sign in to comment.