Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you explain the details of data selection for large scale dataset? #1

Closed
SCZwangxiao opened this issue Nov 23, 2022 · 1 comment

Comments

@SCZwangxiao
Copy link

Excellent work!And I have a question about data selection.

In the dataset section, you adopted data preprocessing and filtering to speed up training.

What is the proprecessing and filtering strategy? Since the pretraining models generally obey the data scaling rule, I think it would make a great difference to results.

@zengyan-97
Copy link
Owner

zengyan-97 commented Nov 24, 2022

Thanks for your reminder. This part will be added to the updated paper.

In fact, we didn’t do preprocessing. we only did filtering to speed up pre-training.

For LAION, we used English data only. Following BLIP, we removed an image if the shorter edge is smaller than 224 pixels. We also removed an image if (height/width) or (width/height) is larger than 3.

For video clip-text pairs, we removed a pair if the number of words is less than 2. Following previous work (I don’t remember which one…I need to check it later), we used CLIP score to filter data. We sampled a frame for a video clip and we calculated the CLIP score between the frame and the text. We removed a video clip-text pair if the score is less than 0.25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants