Skip to content

christophschuhmann/crawlingathome-worker

 
 

Repository files navigation

Crawling@Home

Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP

Setup

  1. `git clone https://github.com/christophschuhmann/crawlingathome-worker, to download headless-crawlingathome.
  2. cd crawlingathome-worker, to enter the directory.
  3. python3 -m venv venv && . venv/bin/activate, to create virtual environment. (not needed if only deployed on a machine only for this purpose)
  4. . setup.sh, to install dependencies.
  5. python3 crawlingathome.py, to start Crawling!

Droplet Setup

  1. use cloud-config.yaml script to init the droplet. remember to change to your SSH privatekey in line 9
  2. ssh with user crawl and check the script by running tail -f crawl.log

TODO

  • Save image embedding
  • Convert images to tfrecords
  • Upload to google drive
  • Prevent corrupt image to be processed
  • Shard of chunk (it needs to read all WAT file which will be bad for low ram server)
  • Crawling@Home integration
  • Verify output

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.5%
  • Shell 2.5%