Skip to content

Commit

Permalink
Merge branch 'main' of github.com:alibaba/data-juicer
Browse files Browse the repository at this point in the history
  • Loading branch information
BeachWang committed Mar 11, 2024
2 parents 5b5c636 + da6440a commit 8c1daf7
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 8 deletions.
14 changes: 7 additions & 7 deletions docs/DJ_SORA.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ English | [中文页面](DJ_SORA_ZH.md)

---

Data is the key to the unprecedented development of large multi-modal models such as SORA. How to obtain and process data efficiently and scientifically faces new challenges! DJ-SORA aims to create a series of large-scale, high-quality open source multi-modal data sets to assist the open source community in data understanding and model training.
Data is the key to the unprecedented development of large multi-modal models such as SORA. How to obtain and process data efficiently and scientifically faces new challenges! DJ-SORA aims to create a series of large-scale, high-quality open-source multi-modal data sets to assist the open-source community in data understanding and model training.

DJ-SORA is based on Data-Juicer (including hundreds of dedicated video, image, audio, text and other multi-modal data processing [operators](Operators_ZH.md) and tools) to form a series of systematic and reusable Multimodal "data recipes" for analyzing, cleaning, and generating large-scale, high-quality multimodal data.

Expand All @@ -18,12 +18,12 @@ This project is being actively updated and maintained. We eagerly invite you to

# Roadmap
## Overview
* [Support high-performance loading and processing of video data](#Support high-performance loading and processing of video data)
* [Basic Operators (video spatio-temporal dimension)](#Basic operator video spatio-temporal dimension)
* [Advanced Operators (fine-grained modal matching and data generation)](#Advanced operators fine-grained modal matching and data generation)
* [Advanced Operators (Video Content)](#Advanced Operator Video Content)
* [DJ-SORA Data Recipes and Datasets](#DJ-SORA Data Recipes and Datasets)
* [DJ-SORA Data Validation and Model Training](#DJ-SORA Data Validation and Model Training)
* [Support high-performance loading and processing of video data](#support-high-performance-loading-and-processing-of-video-data)
* [Basic Operators (video spatio-temporal dimension)](#basic-operators-video-spatio-temporal-dimension)
* [Advanced Operators (fine-grained modal matching and data generation)](#advanced-operators-fine-grained-modal-matching-and-data-generation)
* [Advanced Operators (Video Content)](#advanced-operators-video-content)
* [DJ-SORA Data Recipes and Datasets](#dj-sora-data-recipes-and-datasets)
* [DJ-SORA Data Validation and Model Training](#dj-sora-data-validation-and-model-training)


## Support high-performance loading and processing of video data
Expand Down
2 changes: 1 addition & 1 deletion scripts/dlc/partition_data_dlc.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def partition_data(json_file_path: str, hostnames: List[str]):
nodes_video_size[min_node] += video_sizes[video]

for hostname in hostnames:
host_file_path = f"{json_file_path.rsplit('.', 1)[0]}_{hostname}.json"
host_file_path = f"{json_file_path.rsplit('.', 1)[0]}_{hostname}.jsonl"
with open(host_file_path, 'w') as f:
for entry in nodes_data[hostname]:
f.write(json.dumps(entry) + '\n')
Expand Down
8 changes: 8 additions & 0 deletions scripts/dlc/run_on_dlc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,13 @@ else
sed -i$SED_I_SUFFIX "s|\(dataset_path: '\)\(.*\)'\(.*\)|\1\2_$hostname'\3|" "$new_config_file"
fi

if grep -q "export_path: .*\.json" "$new_config_file"; then
# .json data_path
sed -i$SED_I_SUFFIX "s|\(export_path: \)\(.*\)\(/[^/]*\)\(.json\)|\1\2\3_$hostname\4|" "$new_config_file"
else
# dir export_path
sed -i$SED_I_SUFFIX "s|\(export_path: '\)\(.*\)'\(.*\)|\1\2_$hostname'\3|" "$new_config_file"
fi

# run to process data
python tools/process_data.py --config "$new_config_file"

0 comments on commit 8c1daf7

Please sign in to comment.