-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PARENT ISSUE] Data preprocessing and pseudolabeling #3
Comments
Regarding the save format, we can save them in any suitable format. Then to convert them to tar files to make it compatible with webdatasets, I can probably provide you a script that converts any data format into the tar format. It basically compresses clusters of sample points into tar shards. |
This is the video-dataset format:
Leveraging this, we want to pseudolabel/preprocess that into our format for each modality:
Some more notes on each the modality representations:
Some more details/examples on what the jsonl files should look like for the text-based modalities for a single video:
|
Let's break down the steps here a little bit to get from start to finish. Step 1: Download data in v2d format (#7).
|
Let's use |
This change requires:
|
Also, 4M allows for specifying multiple datasets, so we don't need to actually combine them into one big pool! See |
We want to get video RGB, video RGB tokens, video bounding boxes, video transcriptions, and video descriptions downloaded in a format that matches what 4M expects. Maybe that's something like
except I'm not sure because maybe the text should just be like jsonlines or something? This is very much just a suggestion. First task is to decide what makes the most sense, second is to then implement it. Keep an eye also on #1 because that loads the data and that PR will need to be fixed to correspond to the decisions made here (e.g., rn it assumes text is saved as JSONL, which I picked kinda arbitrarily and is def up for change).
To get video_rgb, we just need to download using video2dataset, probably, with some file saving shenanigans to make it fit our naming/path formats/requirements.
To get video_tok_rgb, we need to run (for now) a pretrained tokenizer on the video_rgb files and save it appropriately with right filetype and names/paths/etc.
To get video_det, we need to run the YOLO pseudolabeler on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run ???something??? on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run whisper on the video_rgb files and get the transcripts appropriately (maybe as JSONL?). (we can also start with the default youtube captions as an easier thing so we don't bring whisper in the mix yet.)
Thank you @yahya for taking the lead on implementing these steps. @garjania if you could provide feedback/suggestions on the right format for saving these things/how this corresponds with video2dataset that'd be super helpful! I think one concrete unknown to pursue in making the decision is to first look at how video2dataset stores files and decide whether we bend more to follow video2dataset or if we use that as an intermediary to extract the captions, etc. and form them into this format for 4M. also @vesteinn if you're familiar with v2d?
The text was updated successfully, but these errors were encountered: