Skip to content

Commit

Permalink
vbench-test done
Browse files Browse the repository at this point in the history
  • Loading branch information
BeachWang committed May 7, 2024
2 parents 17d325a + 453db43 commit f6838e9
Show file tree
Hide file tree
Showing 29 changed files with 394 additions and 126 deletions.
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ ENV JAVA_HOME=/opt/jdk

WORKDIR /data-juicer

# install requirements which need to be installed from source
RUN pip install git+https://github.com/xinyu1205/recognize-anything.git --default-timeout 1000

# install requirements first to better reuse installed library cache
COPY environments/ environments/
RUN cat environments/* | xargs pip install --default-timeout 1000
Expand Down
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,15 +175,15 @@ pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

| Tag | Description |
|--------------|----------------------------------------------------------------------------------------------|
| Tag | Description |
|------------------|----------------------------------------------------------------------------------------------|
| `.` or `.[mini]` | Install minimal dependencies for basic Data-Juicer. |
| `.[all]` | Install all optional dependencies (including minimal dependencies and all of the following). |
| `.[sci]` | Install all dependencies for all OPs. |
| `.[dist]` | Install dependencies for distributed data processing. (Experimental) |
| `.[dev]` | Install dependencies for developing the package as contributors. |
| `.[tools]` | Install dependencies for dedicated tools, such as quality classifiers. |
| `.[sandbox]` | Install dependencies for sandbox, such as VBench for video evaluation. |
| `.[all]` | Install all optional dependencies (including minimal dependencies and all of the following). |
| `.[sci]` | Install all dependencies for all OPs. |
| `.[sandbox]` | Install all dependencies for sandbox. |
| `.[dist]` | Install dependencies for distributed data processing. (Experimental) |
| `.[dev]` | Install dependencies for developing the package as contributors. |
| `.[tools]` | Install dependencies for dedicated tools, such as quality classifiers. |

### Using pip

Expand Down Expand Up @@ -214,6 +214,8 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.

### Installation check

Expand Down
18 changes: 10 additions & 8 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,15 +158,15 @@ pip install -v -e .[tools] # 安装部分工具库的依赖

依赖选项如下表所示:

| 标签 | 描述 |
|--------------|------------------------------|
| 标签 | 描述 |
|------------------|------------------------------|
| `.` 或者 `.[mini]` | 安装支持 Data-Juicer 基础功能的最小依赖项 |
| `.[all]` | 安装所有可选依赖项(包括最小依赖项以及下面所有依赖项) |
| `.[sci]` | 安装所有算子的全量依赖 |
| `.[dist]` | 安装以分布式方式进行数据处理的依赖(实验性功能) |
| `.[dev]` | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| `.[tools]` | 安装专用工具库(如质量分类器)所需的依赖项 |
| `.[sandbox]` | 安装沙河实验需要的依赖库,如用于视频评测的VBench |
| `.[all]` | 安装所有可选依赖项(包括最小依赖项以及下面所有依赖项) |
| `.[sci]` | 安装所有算子的全量依赖 |
| `.[sandbox]` | 安装沙盒实验室的基础依赖 |
| `.[dist]` | 安装以分布式方式进行数据处理的依赖(实验性功能) |
| `.[dev]` | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| `.[tools]` | 安装专用工具库(如质量分类器)所需的依赖项 |

### 使用 pip 安装

Expand All @@ -193,6 +193,8 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- `<version_tag>`的格式类似于`v0.2.0`,与发布(Release)的版本号相同。

### 安装校验

Expand Down
8 changes: 8 additions & 0 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,11 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

## For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in [general-video-refine-example.yaml](general-video-refine-example.yaml). Here we apply three types of OPs:
- Text-Only: to improve the dataset quality according to the video captions.
- Video-Only: to improve the dataset quality according to the video features.
- Text-Video: to improve the dataset quality according to the alignment between text and videos.
Users can start to process their video datasets based on this recipe.
9 changes: 9 additions & 0 deletions configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,12 @@
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

## 视频数据集

我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
- 仅文本:根据视频描述提高数据集质量
- 仅视频:根据视频性质提高数据集质量
- 文本-视频:根据文本和视频间的对齐提高数据集质量
用户可以基于这个菜谱开始他们的视频数据集处理流程。
-
65 changes: 65 additions & 0 deletions configs/data_juicer_recipes/general-video-refine-example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Process config example including:
# - all global arguments
# - all ops and their arguments

# global parameters
project_name: 'all' # project name for distinguish your configs
dataset_path: '/path/to/a/video-text/dataset.jsonl'
# accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: '/path/to/store/refined/dataset.jsonl'
np: 48 # number of subprocess to process your dataset
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer

# for multimodal data processing
video_key: 'videos' # key name of field to store the list of sample video paths.
video_special_token: '<__dj__video>' # the special token that represents a video in the text. In default, it's "<__dj__video>". You can specify your own special token according to your input dataset.

eoc_special_token: '<|__dj__eoc|>' # the special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset.

# process schedule: a list of several process operators with their arguments
# hyperparameters are set according to the 3-sigma stats on MSR-VTT dataset
process:
- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
lang: en # keep text in what language
min_score: 0.26311219 # the min language scores to filter text
- perplexity_filter: # filter text with perplexity score out of specific range
lang: en # compute perplexity in what language
max_ppl: 7376.81378 # the max perplexity score to filter text
- video_aesthetics_filter: # filter samples according to the aesthetics score of frame images extracted from videos.
hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE # Huggingface model name for the aesthetics predictor
min_score: 0.31767486 # the min aesthetics score of filter range
max_score: 1.0 # the max aesthetics score of filter range
frame_sampling_method: 'uniform' # sampling method of extracting frame images from the videos. Should be one of ["all_keyframe", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "uniform" with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode to the all frames extracted from videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- video_frames_text_similarity_filter: # keep samples those similarities between sampled video frame images and text within a specific range.
hf_clip: openai/clip-vit-base-patch32 # clip model name on huggingface to compute the similarity between frame image and text. It's kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.
min_score: 0.16571071 # the min similarity to keep samples.
max_score: 1.0 # the max similarity to keep samples.
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
horizontal_flip: false # flip frame image horizontally (left to right).
vertical_flip: false # flip frame image vertically (top to bottom).
reduce_mode: avg # reduce mode when one text corresponds to multiple videos in a chunk, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all videos meet the filter condition
- video_motion_score_filter: # Keep samples with video motion scores within a specific range.
min_score: 0.25 # the minimum motion score to keep samples
max_score: 10000.0 # the maximum motion score to keep samples
sampling_fps: 2 # the samplig rate of frames_per_second to compute optical flow
any_or_all: any # keep this sample when any/all videos meet the filter condition
- video_nsfw_filter: # filter samples according to the nsfw scores of videos in them
hf_nsfw_model: Falconsai/nsfw_image_detection # Huggingface model name for nsfw classification
score_threshold: 0.34847191 # the nsfw score threshold for samples, range from 0 to 1
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode for multiple sampled video frames to compute nsfw scores of videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- video_watermark_filter: # filter samples according to the predicted watermark probabilities of videos in them
hf_watermark_model: amrul-hzz/watermark_detector # Huggingface model name for watermark classification
prob_threshold: 0.96510297 # the predicted watermark probability threshold for samples, range from 0 to 1
frame_sampling_method: all_keyframes # sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".
frame_num: 3 # the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode: avg # reduce mode for multiple sampled video frames to compute final predicted watermark probabilities of videos, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
11 changes: 5 additions & 6 deletions configs/demo/sandbox/vbench_eval_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,11 @@ eval_name: <eval_name>
load_ckpt_from_local: false

# The dimensions considered in this eval.
# All dimensions include: ['subject_consistency', 'background_consistency',
# 'temporal_flickering', 'dynamic_degree', 'aesthetic_quality',
# 'object_class', 'multiple_objects', 'human_action', 'color',
# 'spatial_relationship', 'scene', 'temporal_style',
# 'appearance_style', 'overall_consistency']
# NOTE: the evaluation of motion_smoothness and imaging_quality has bug in the pipy vbench repository.
# All dimensions include: ['subject_consistency', 'background_consistency', 'temporal_flickering',
# 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality', 'object_class',
# 'multiple_objects', 'human_action', 'color', 'spatial_relationship', 'scene', 'temporal_style',
# 'appearance_style', 'overall_consistency']
# NOTE: current version of vbench in pypi has bug in evaluation on dimension motion_smoothness and imaging_quality
dimension_list:
- subject_consistency
- dynamic_degree
45 changes: 40 additions & 5 deletions data_juicer/config/config.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
import copy
import os
import shutil
import time
from argparse import ArgumentError
from argparse import ArgumentError, Namespace
from typing import Dict, List, Tuple, Union

from jsonargparse import (ActionConfigFile, ArgumentParser, dict_to_namespace,
namespace_to_dict)
from jsonargparse.typehints import ActionTypeHint
from jsonargparse.typing import ClosedUnitInterval, NonNegativeInt, PositiveInt
from loguru import logger

Expand Down Expand Up @@ -370,8 +372,8 @@ def init_setup_from_cfg(cfg):
2. update cache directory
3. update checkpoint and `temp_dir` of tempfile
:param cfg: a original cfg
:param cfg: a updated cfg
:param cfg: an original cfg
:param cfg: an updated cfg
"""

cfg.export_path = os.path.abspath(cfg.export_path)
Expand Down Expand Up @@ -552,16 +554,16 @@ def update_op_process(cfg, parser):
# e.g.
# `python demo.py --config demo.yaml
# --language_id_score_filter.lang en`
temp_cfg = cfg
for i, op_in_process in enumerate(cfg.process):
op_in_process_name = list(op_in_process.keys())[0]

temp_cfg = cfg
if op_in_process_name not in option_in_commands:

# update op params to temp cfg if set
if op_in_process[op_in_process_name]:
temp_cfg = parser.merge_config(
dict_to_namespace(op_in_process), cfg)
dict_to_namespace(op_in_process), temp_cfg)
else:

# args in the command line override the ones in `cfg.process`
Expand All @@ -584,9 +586,42 @@ def update_op_process(cfg, parser):
None if internal_op_para is None else
namespace_to_dict(internal_op_para)
}

# check the op params via type hint
temp_parser = copy.deepcopy(parser)
recognized_args = set([
action.dest for action in parser._actions
if hasattr(action, 'dest') and isinstance(action, ActionTypeHint)
])

temp_args = namespace_to_arg_list(temp_cfg,
includes=recognized_args,
excludes=['config'])
temp_args = ['--config', temp_cfg.config[0].absolute] + temp_args
temp_parser.parse_args(temp_args)
return cfg


def namespace_to_arg_list(namespace, prefix='', includes=None, excludes=None):
arg_list = []

for key, value in vars(namespace).items():

if issubclass(type(value), Namespace):
nested_args = namespace_to_arg_list(value, f'{prefix}{key}.')
arg_list.extend(nested_args)
elif value is not None:
concat_key = f'{prefix}{key}'
if includes is not None and concat_key not in includes:
continue
if excludes is not None and concat_key in excludes:
continue
arg_list.append(f'--{concat_key}')
arg_list.append(f'{value}')

return arg_list


def config_backup(cfg):
cfg_path = cfg.config[0].absolute
work_dir = cfg.work_dir
Expand Down
Loading

0 comments on commit f6838e9

Please sign in to comment.