Could you explain the details of data selection for large scale dataset？ #1

SCZwangxiao · 2022-11-23T15:51:11Z

Excellent work！And I have a question about data selection.

In the dataset section, you adopted data preprocessing and filtering to speed up training.

What is the proprecessing and filtering strategy? Since the pretraining models generally obey the data scaling rule, I think it would make a great difference to results.

zengyan-97 · 2022-11-24T02:46:33Z

Thanks for your reminder. This part will be added to the updated paper.

In fact, we didn’t do preprocessing. we only did filtering to speed up pre-training.

For LAION, we used English data only. Following BLIP, we removed an image if the shorter edge is smaller than 224 pixels. We also removed an image if (height/width) or (width/height) is larger than 3.

For video clip-text pairs, we removed a pair if the number of words is less than 2. Following previous work (I don’t remember which one…I need to check it later), we used CLIP score to filter data. We sampled a frame for a video clip and we calculated the CLIP score between the frame and the text. We removed a video clip-text pair if the score is less than 0.25.

SCZwangxiao closed this as completed Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you explain the details of data selection for large scale dataset？ #1

Could you explain the details of data selection for large scale dataset？ #1

SCZwangxiao commented Nov 23, 2022

zengyan-97 commented Nov 24, 2022 •

edited

Loading

Could you explain the details of data selection for large scale dataset？ #1

Could you explain the details of data selection for large scale dataset？ #1

Comments

SCZwangxiao commented Nov 23, 2022

zengyan-97 commented Nov 24, 2022 • edited Loading

zengyan-97 commented Nov 24, 2022 •

edited

Loading