Crawling@Home

Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP

Setup

`git clone https://github.com/christophschuhmann/crawlingathome-worker, to download headless-crawlingathome.
cd crawlingathome-worker, to enter the directory.
python3 -m venv venv && . venv/bin/activate, to create virtual environment. (not needed if only deployed on a machine only for this purpose)
. setup.sh, to install dependencies.
python3 crawlingathome.py, to start Crawling!

use cloud-config.yaml script to init the droplet. remember to change to your SSH privatekey in line 9
ssh with user crawl and check the script by running tail -f crawl.log

Save image embedding
Convert images to tfrecords
Upload to google drive
Prevent corrupt image to be processed
Shard of chunk (it needs to read all WAT file which will be bad for low ram server)
Crawling@Home integration
Verify output

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.gitignore		.gitignore
README.md		README.md
clip_filter.py		clip_filter.py
cloud-config.yaml		cloud-config.yaml
crawlingathome.py		crawlingathome.py
requirements.txt		requirements.txt
setup.sh		setup.sh