Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP
- `git clone https://github.com/christophschuhmann/crawlingathome-worker, to download headless-crawlingathome.
cd crawlingathome-worker
, to enter the directory.python3 -m venv venv && . venv/bin/activate
, to create virtual environment. (not needed if only deployed on a machine only for this purpose). setup.sh
, to install dependencies.python3 crawlingathome.py
, to start Crawling!
- use
cloud-config.yaml
script to init the droplet. remember to change to your SSH privatekey in line 9 - ssh with user
crawl
and check the script by runningtail -f crawl.log
- Save image embedding
- Convert images to tfrecords
- Upload to google drive
- Prevent corrupt image to be processed
- Shard of chunk (it needs to read all WAT file which will be bad for low ram server)
- Crawling@Home integration
- Verify output