[PARENT ISSUE] Data preprocessing and pseudolabeling #3

kdu4108 · 2024-06-21T12:02:40Z

We want to get video RGB, video RGB tokens, video bounding boxes, video transcriptions, and video descriptions downloaded in a format that matches what 4M expects. Maybe that's something like

root/video_rgb/shard-00000.tar
root/video_rgb/shard-00001.tar
root/video_rgb/shard-00002.tar

root/video_tok_rgb/shard-00000.tar
root/video_tok_rgb/shard-00001.tar
root/video_tok_rgb/shard-00002.tar

root/video_det/shard-00000.tar
root/video_det/shard-00001.tar
root/video_det/shard-00002.tar

root/video_transcript/shard-00000.tar
root/video_transcript/shard-00001.tar
root/video_transcript/shard-00002.tar

root/video_description/shard-00000.tar
root/video_description/shard-00001.tar
root/video_description/shard-00002.tar

except I'm not sure because maybe the text should just be like jsonlines or something? This is very much just a suggestion. First task is to decide what makes the most sense, second is to then implement it. Keep an eye also on #1 because that loads the data and that PR will need to be fixed to correspond to the decisions made here (e.g., rn it assumes text is saved as JSONL, which I picked kinda arbitrarily and is def up for change).

To get video_rgb, we just need to download using video2dataset, probably, with some file saving shenanigans to make it fit our naming/path formats/requirements.
To get video_tok_rgb, we need to run (for now) a pretrained tokenizer on the video_rgb files and save it appropriately with right filetype and names/paths/etc.
To get video_det, we need to run the YOLO pseudolabeler on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run ???something??? on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run whisper on the video_rgb files and get the transcripts appropriately (maybe as JSONL?). (we can also start with the default youtube captions as an easier thing so we don't bring whisper in the mix yet.)

Thank you @yahya for taking the lead on implementing these steps. @garjania if you could provide feedback/suggestions on the right format for saving these things/how this corresponds with video2dataset that'd be super helpful! I think one concrete unknown to pursue in making the decision is to first look at how video2dataset stores files and decide whether we bend more to follow video2dataset or if we use that as an intermediary to extract the captions, etc. and form them into this format for 4M. also @vesteinn if you're familiar with v2d?

The text was updated successfully, but these errors were encountered:

garjania · 2024-06-24T13:58:04Z

Regarding the save format, we can save them in any suitable format. Then to convert them to tar files to make it compatible with webdatasets, I can probably provide you a script that converts any data format into the tar format. It basically compresses clusters of sample points into tar shards.

kdu4108 · 2024-07-03T09:22:53Z

This is the video-dataset format:

 ├── 00000.tar
 |     ├── 00000.mp4
 |     ├── 00000.txt
 |     ├── 00000.json
 |     ├── 00001.mp4
 |     ├── 00001.txt
 |     ├── 00001.json
 |     └── ...
 |     ├── 10000.mp4
 |     ├── 10000.txt
 |     ├── 10000.json
 ├── 00001.tar
 |     ├── 10001.mp4
 |     ├── 10001.txt
 |     ├── 10001.json
 │     ...
 ...

Leveraging this, we want to pseudolabel/preprocess that into our format for each modality:

root/video_rgb/shard-00000.tar
 |     ├── 00000.mp4 # this corresponds to one video.
 |     ├── 00001.mp4
 |     └── ...

root/video_tok_rgb/shard-00000.tar
 |     ├── 00000.npy # this corresponds to one video. shape: something like (num_frames, H, C, W)
 |     ├── 00001.npy
 |     └── ...

root/video_det/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one frame.
 |     ├── 00001.jsonl
 |     └── ...

root/video_transcript/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

root/video_description/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

Some more notes on each the modality representations:

video_rgb: each mp4 represents a different video. possibly there'll be other video formats too.
video_tok_rgb: each npy here represents the tokenizations of all frames in the corresponding mp4. the shape of *.npy here will be something like (num_frames, H, W, C).
video_det: each jsonl represents the bounding boxes for a video. The ith line within the jsonl is the bounding boxes for the ith frame of that video.
video_transcript: each jsonl represents the transcripts for a video. The ith line within the jsonl is the transcript for the ith subsequence of frames of that video. Note that the transcript need not be consecutive in all frames, e.g., you can see a skip from frames 5-10 with no transcripts.
video_description: like with transcripts, each jsonl represents the transcripts for a video. The ith line within the jsonl is the description for the ith subsequence of frames of that video. Note that the description needs to be consecutive in all frames.

Some more details/examples on what the jsonl files should look like for the text-based modalities for a single video:
video_det:

[
        # FRAME 0 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                {
                    "boxes": [
                        0.4229210317134857,
                        0.00020096010121051222,
                        0.5715101361274719,
                        0.13699540495872498
                    ],
                    "score": 0.9029952883720398,
                    "class_id": 74,
                    "class_name": "clock",
                    "segmentation": [
                        [
                            0.5055187637969095,
                            0.1337890625,
                            ...
                        ]
                    ]
                },
                {
                    "boxes": [
                        ...
                    ],
                    ...
                },
                    ...
            ]
        },
        # FRAME 1 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                ...,
            ],
            ...
        }
]

video_transcript:

[
            {
                "transcript": "here's a transcript",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "transcript": "here's another transcript",
                "start_frame_index": 10,
                "end_frame_index": 13,
            } 
]

video_description:

[
            {
                "description": "here's a description",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "description": "here's another description",
                "start_frame_index": 5,
                "end_frame_index": 12,
            } 
]

kdu4108 · 2024-07-04T15:17:21Z

Let's break down the steps here a little bit to get from start to finish.

Step 1: Download data in v2d format (#7).
Step 2: Transform from v2d format into video_rgb format and save in video_rgb/ directory (#10).
Step 3: Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory (#9).
Step 4: Transform from video_rgb format into video_det format and save in video_det/ directory (#11).
Step 5: Transform from v2d format into video_transcript format and save in video_transcript/ directory (#12)
Step 6: Transform from v2d format into video_description format and save in video_description/ directory (#13)

This image summarizes the dependency graph of what data type is transformed into what (as well as what the file representations within those data types/modalities are).

kdu4108 · 2024-07-05T09:14:39Z

Let's use /store/swissai/a08/data/4m-data as the root dir for storing the data.

kdu4108 · 2024-08-02T13:40:26Z

Updated data design

kdu4108 · 2024-08-02T13:47:20Z

This change requires:

adding a filter_raw function - Create filter_raw function #23 - and running it on the datasets
pointing all of the pseudolabelers to extract/read from the filtered_raw/ directory instead of video_rgb/ (affecting Transform from video_rgb format into video_det format and save in video_det/ directory. #11, Transform from v2d format into video_transcript format and save in video_transcript/ directory. #12, Transform from v2d format into video_description format and save in video_description/ directory. #13, Transform from v2d format into video_rgb format and save in video_rgb/ directory #10, add save_vq_tokens_vid.py #14, Add transcript + metadata processing #15 , Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory. #9)

kdu4108 · 2024-08-02T13:49:06Z

Also, 4M allows for specifying multiple datasets, so we don't need to actually combine them into one big pool! See ml-4m/cfgs/default/4m/data/cc12m+coyo+c4/main/mix_mod21_all2allmix_rgb2all_capT5bias_C4.yaml for an example.

kdu4108 mentioned this issue Jun 21, 2024

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

Open

kdu4108 mentioned this issue Jul 3, 2024

Validate whether yolo bbox pseudolabels are consistent across time. #6

Open

kdu4108 changed the title ~~Data prepocessing and pseudolabeling~~ Data preprocessing and pseudolabeling Jul 4, 2024

kdu4108 assigned kdu4108, markus583 and yahya010 Jul 10, 2024

kdu4108 added the in progress label Jul 10, 2024

kdu4108 changed the title ~~Data preprocessing and pseudolabeling~~ [PARENT ISSUE] Data preprocessing and pseudolabeling Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

kdu4108 commented Jun 21, 2024

garjania commented Jun 24, 2024

kdu4108 commented Jul 3, 2024

kdu4108 commented Jul 4, 2024

kdu4108 commented Jul 5, 2024

kdu4108 commented Aug 2, 2024 •

edited

Loading

kdu4108 commented Aug 2, 2024 •

edited

Loading

kdu4108 commented Aug 2, 2024

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

Comments

kdu4108 commented Jun 21, 2024

garjania commented Jun 24, 2024

kdu4108 commented Jul 3, 2024

kdu4108 commented Jul 4, 2024

kdu4108 commented Jul 5, 2024

kdu4108 commented Aug 2, 2024 • edited Loading

kdu4108 commented Aug 2, 2024 • edited Loading

kdu4108 commented Aug 2, 2024

kdu4108 commented Aug 2, 2024 •

edited

Loading

kdu4108 commented Aug 2, 2024 •

edited

Loading