Skip to content

Commit

Permalink
fix according to yilun's comments
Browse files Browse the repository at this point in the history
  • Loading branch information
yxdyc committed Nov 8, 2023
1 parent 6293d8d commit 2b9763c
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ dist
# others
.DS_Store
.idea/
wandb/
__pycache__
5 changes: 5 additions & 0 deletions tools/hpo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,11 @@ python execute_hpo.py --config <data-process-cfg-file-path> --hpo_config <hpo-cf
python execute_hpo.py --config configs/process.yaml --hpo_config configs/quality_score_hpo.yaml
```

For the configuration for data recipe, i.e., `<data-process-cfg-file-path>`,
please see more details in our [guidance](https://github.
com/alibaba/data-juicer#build-up-config-files). As for the configuration
for HPO, i.e., `<hpo-cfg-file-path>`, please refer to sweep [guidance](https://docs.wandb.ai/guides/sweeps/define-sweep-configuration).

We provide an illustrative objective "quality_score" in `hpo/objects.py`,
which uses quality scorer to measure the processed data, and links the average scores to hyper-parameters of data recipes.
After running it, you will get the result similar to: ![img](https://img.alicdn.com/imgextra/i2/O1CN017fT4Al1bVldeuCmiI_!!6000000003471-2-tps-2506-1710.png)
Expand Down
8 changes: 7 additions & 1 deletion tools/hpo/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

我们将自动化 HPO (hyper-parameters optimization) 工具 WandB [Sweep](https://docs.wandb.ai/guides/sweeps) 结合到
Data-Juicer 中,以简化改良数据处理超参数的过程。
使用此工具,用户可以研究探索 *数据配方的特定超参数**指定目标度量(如数据质量分、模型loss等)* 之间的 相关性和重要性得分
使用此工具,用户可以研究探索 *数据配方的特定超参数**指定目标度量(如数据质量分、模型loss等)* 之间的 相关性和重要性得分

**注意**:这是一个实验性功能。 用于数据配方的 Auto-HPO 仍然有
一个极大的探索空间,暂无标准做法。 欢迎大家提出更多的建议、讨论、
Expand Down Expand Up @@ -38,6 +38,12 @@ python execute_hpo.py --config <data-process-cfg-file-path> --hpo_config <hpo-cf
python execute_hpo.py --config configs/process.yaml --hpo_config configs/quality_score_hpo.yaml
```

对于数据菜谱的配置,即`<data-process-cfg-file-path>`
请参阅我们的 [指南](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md#%E6%9E%84%E5%BB%BA%E9%85%8D%E7%BD%AE%E6%96%87%E4%BB%B6)
获取更多详细信息。
对于HPO的配置,即`<hpo-cfg-file-path>`,请参阅Sweep提供的 [指南](https://docs.wandb.ai/guides/sweeps/define-sweep-configuration)


我们在`hpo/objects.py`中提供了一个示意性的搜索目标 `quality_score`
它使用质量评分器来度量处理后的数据,并将平均质量分数链接到数据配方的超参数。
运行后,你会得到类似如下的结果:![img](https://img.alicdn.com/imgextra/i2/O1CN017fT4Al1bVldeuCmiI_!!6000000003471-2-tps-2506-1710.png)
Expand Down
2 changes: 1 addition & 1 deletion tools/hpo/configs/quality_score_hpo.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

sweep_name: hpo_for_data-juicer
# sweep_count: 10
sweep_max_count: 1000 # the maximal number of trials; `None` for unlimited

# hpo configuration from original sweep, see more options and details in
# https://docs.wandb.ai/guides/sweeps/define-sweep-configuration
Expand Down
4 changes: 2 additions & 2 deletions tools/hpo/execute_hpo.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,5 @@ def search():

wandb.agent(sweep_id,
function=search,
count=sweep_configuration['sweep_count']
if 'sweep_count' in sweep_configuration else None)
count=sweep_configuration['sweep_max_count']
if 'sweep_max_count' in sweep_configuration else None)

0 comments on commit 2b9763c

Please sign in to comment.