Skip to content

Commit

Permalink
fix according to yilun's comments
Browse files Browse the repository at this point in the history
  • Loading branch information
yxdyc committed Nov 8, 2023
1 parent c96b7cf commit 6293d8d
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 77 deletions.
31 changes: 13 additions & 18 deletions data_juicer/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,12 +436,7 @@ def merge_config(ori_cfg, new_cfg: Dict):
try:
ori_specified_op_names = set()
ori_specified_op_idx = {} # {op_name: op_order}
# format of ori_cfg.process
# ori_cfg.process[i] = {
# op_in_process_name:
# None if internal_op_para is None else
# namespace_to_dict(internal_op_para)
# }

for op_order, op_in_process in enumerate(ori_cfg.process):
op_name = list(op_in_process.keys())[0]
ori_specified_op_names.add(op_name)
Expand All @@ -450,13 +445,13 @@ def merge_config(ori_cfg, new_cfg: Dict):
for new_k, new_v in new_cfg.items():
# merge parameters other than `cfg.process` and DJ-OPs
if new_k in ori_cfg and new_k != 'process' and '.' not in new_k:
print(
'=' * 15, f'\nBefore merging, the cfg item is: '
f'{new_k}: {ori_cfg[new_k]}')
logger.info('=' * 15)
logger.info(f'Before merging, the cfg item is: '
f'{new_k}: {ori_cfg[new_k]}')
ori_cfg[new_k] = new_v
print(
f'After merging, the cfg item is: '
f'{new_k}: {new_v}\n', '=' * 15, '\n')
logger.info(f'After merging, the cfg item is: '
f'{new_k}: {new_v}')
logger.info('=' * 15)
else:
# merge parameters of DJ-OPs into cfg.process
# for nested style, e.g., `remove_table_text_mapper.min_col: 2`
Expand All @@ -466,13 +461,13 @@ def merge_config(ori_cfg, new_cfg: Dict):
op_name, para_name = key_as_groups[0], key_as_groups[1]
op_order = ori_specified_op_idx[op_name]
ori_cfg_val = ori_cfg.process[op_order][op_name][para_name]
print(
'=' * 15, f'\nBefore merging, the cfg item is: '
f'{new_k}: {ori_cfg_val}')
logger.info('=' * 15)
logger.info(f'Before merging, the cfg item is: '
f'{new_k}: {ori_cfg_val}')
ori_cfg.process[op_order][op_name][para_name] = new_v
print(
f'After merging, the cfg item is: '
f'{new_k}: {new_v}\n', '=' * 15, '\n')
logger.info(f'After merging, the cfg item is: '
f'{new_k}: {new_v}')
logger.info('=' * 15)

ori_cfg = init_setup_from_cfg(ori_cfg)

Expand Down
55 changes: 3 additions & 52 deletions tools/hpo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ We incorporate an automated HPO tool, WandB [Sweep](https://docs.wandb.ai/guides
With this tool, users can investigate correlations and importance scores of
specific hyper-parameters of data recipes from the HPO view.

*Note*: this is an experimental feature. Auto-HPO for data recipes still has
**Note**: this is an experimental feature. Auto-HPO for data recipes still has
a large room to explore. Feel free to provide more suggestions, discussion,
and contribution via new PRs!


## Prerequisite
You need to install data-juicer first.
Besides, the tool leverages WandB, install it via `pip install wandb`.
Before using this tool, you need to run `
Before using this tool, you need to run
```wandb login``` and enter your WandB
API key.
If you have your own instance of WandB (e.g., [locally-hosted machine](https://docs.wandb.ai/guides/hosting/)), run the following script:
Expand Down Expand Up @@ -47,54 +47,5 @@ After running it, you will get the result similar to: ![img](https://img.alicdn.
You can implement your own HPO objective in `get_hpo_objective` function, e.g., linking the data
recipes to
- model_loss (by replacing the quality scorer into a training procedure),
- downstream_task (by eplacing the quality scorer into a training and an# Hyper-parameter Optimization for Data Recipe

## Auto-HPO

We incorporate an automated HPO tool, WandB [Sweep](https://docs.wandb.ai/guides/sweeps), into Data-Juicer to streamline the finding of good data processing hyper-parameters.
With this tool, users can investigate correlations and importance scores of
specific hyper-parameters of data recipes from the HPO view.

*Note*: this is an experimental feature. Auto-HPO for data recipes still has
a large room to explore. Feel free to provide more suggestions, discussions, and contributions via new PRs!


## Prerequisite
You need to install data-juicer first.
Besides, the tool leverages WandB, install it via `pip install wandb`.
Before using this tool, you need to run `
```wandb login``` and enter your WandB
API key.
If you have your own instance of WandB (e.g., [locally-hosted machine](https://docs.wandb.ai/guides/hosting/)), run the following script:

```shell
wandb login --host <URL of your wandb instance>
# enter your api key
```



## Usage and Customization

Given a data recipe, characterized by a specified configuration file
`<data-process-cfg-file-path>`, you can use `execute_hpo.py` to search the
hyper-parameter space defined by `<hpo-cfg-file-path>`.
```shell
# cd tools/hpo
python execute_hpo.py --config <data-process-cfg-file-path> --hpo_config <hpo-cfg-file-path>

# e.g.,
python execute_hpo.py --config configs/process.yaml --hpo_config configs/quality_score_hpo.yaml
```

We provide an illustrative objective "quality_score" in `hpo/objects.py`,
which uses quality scorer to measure the processed data, and links the average scores to hyper-parameters of data recipes.
After running it, you will get the result similar to: ![img](https://img.alicdn.com/imgextra/i2/O1CN017fT4Al1bVldeuCmiI_!!6000000003471-2-tps-2506-1710.png)


You can implement your own HPO objective in `get_hpo_objective` function, e.g., linking the data
recipes to
- model_loss (by replacing the quality scorer with a training procedure),
- downstream_task (by replacing the quality scorer with training and
evaluation procedures), or
- downstream_task (by replacing the quality scorer with training and evaluation procedures), or
- some synergy measures that combine metrics you are interested in, such that the trade-offs from different views can be explored.
4 changes: 2 additions & 2 deletions tools/hpo/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
Data-Juicer 中,以简化改良数据处理超参数的过程。
使用此工具,用户可以研究探索 *数据配方的特定超参数**指定目标度量(如数据质量分、模型loss等)* 之间的 相关性和重要性得分

*注意*:这是一个实验性功能。 用于数据配方的 Auto-HPO 仍然有
**注意**:这是一个实验性功能。 用于数据配方的 Auto-HPO 仍然有
一个极大的探索空间,暂无标准做法。 欢迎大家提出更多的建议、讨论、
并通过新的 PR 做出贡献!

Expand Down Expand Up @@ -39,7 +39,7 @@ python execute_hpo.py --config configs/process.yaml --hpo_config configs/quality
```

我们在`hpo/objects.py`中提供了一个示意性的搜索目标 `quality_score`
它使用质量评分器来度量处理后的数据,并将平均质分数链接到数据配方的超参数
它使用质量评分器来度量处理后的数据,并将平均质量分数链接到数据配方的超参数
运行后,你会得到类似如下的结果:![img](https://img.alicdn.com/imgextra/i2/O1CN017fT4Al1bVldeuCmiI_!!6000000003471-2-tps-2506-1710.png)


Expand Down
9 changes: 4 additions & 5 deletions tools/hpo/configs/quality_score_hpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ parameters:
min: 256
max: 8192


#early_terminate:
# type: hyperband
# max_iter: 27
# s: 2
early_terminate:
type: hyperband
max_iter: 27
s: 2

0 comments on commit 6293d8d

Please sign in to comment.