Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download data to todi in video2dataset format. #7

Closed
kdu4108 opened this issue Jul 3, 2024 · 2 comments · Fixed by swiss-ai/video2dataset#5
Closed

Download data to todi in video2dataset format. #7

kdu4108 opened this issue Jul 3, 2024 · 2 comments · Fixed by swiss-ai/video2dataset#5
Assignees

Comments

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 3, 2024

Download HowTo100M and Hd-vila datasets in video2dataset format using the scripts/methods described in https://github.com/swiss-ai/video2dataset/tree/main/swiss_ai.

This will be downloaded in this format:

 ├── 00000.tar
 |     ├── 00000.mp4
 |     ├── 00000.txt
 |     ├── 00000.json
 |     ├── 00001.mp4
 |     ├── 00001.txt
 |     ├── 00001.json
 |     └── ...
 |     ├── 10000.mp4
 |     ├── 10000.txt
 |     ├── 10000.json
 ├── 00001.tar
 |     ├── 10001.mp4
 |     ├── 10001.txt
 |     ├── 10001.json
 │     ...
 ...

Child issue of #3.

@kdu4108 kdu4108 changed the title Download data to todi and setup Download data to todi in video2dataset format. Jul 4, 2024
@kdu4108 kdu4108 self-assigned this Jul 5, 2024
@kdu4108
Copy link
Collaborator Author

kdu4108 commented Jul 10, 2024

Update: we have built a Docker image for containerized runs with v2d here: https://github.com/swiss-ai/containers.

We have done small runs to test the right setting for the num_tasks and num_cpus_per_task settings and found that tasks=2 and cpus=128 is optimal (compared to more tasks and fewer cpus).

To download the data fully, we will need to follow these steps:

  1. Allocate and enter a compute node
    a. Run salloc -t240 to allocate compute, and note the jobid for that allocation.
  2. From login node, create a container based on the Docker image we just created and run the job, e.g. for howto100m:
srun --overlap --jobid <JOBID> --environment=/store/swissai/a08/containers/v2d/v2d.toml --container-workdir=$PWD video2dataset --url_list="/store/swissai/a08/data/raw/howto100m/v2d/howto100m_v2d.csv" --config="/store/swissai/a08/kdu/video2dataset/swiss_ai/configs/download_todi.yaml" --output_folder="/store/swissai/a08/data/raw/howto100m/v2d" --input_format="csv" --output_format="webdataset" --url_col="video_link" --encode_formats="{'video': 'mp4', 'audio':'m4a'}"

The exact download commands for each dataset can be found in the respective readmes of https://github.com/swiss-ai/video2dataset/tree/kdu/todi/swiss_ai/datasets.

@kdu4108
Copy link
Collaborator Author

kdu4108 commented Jul 10, 2024

Most of the code to do this is in the video2dataset repo, so progress and code will be documented in this PR swiss-ai/video2dataset#5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant