Please note this is a work in progress. If you just want to use the script, use this
- User will login with credentials, upload a file with a list of domains in a certain format and schedule a crawl.
- Once crawling is done. User can download the zip file which will have all ads.txt content in csv format.
- At the backend all scheduled crawls will be registered as jobs in queues in Rabbitmq.
- Multiple workers will be spawned to handle parallel demand for crawling.
NOTE: List of domains should be written separately each on a new line.
domain1.com
domain2.in
www.domain3.net
adstxt/ --- Helper scripts, spiders and other scrapy files.
adstxtui/ --- All UI related files.
archives/ --- Old archived code for reference.
crawl.sh --- shell script to run individual spider.
docs/ --- Required documents for reference.
requirements.txt --- List of python libraries required by this app.
LICENSE --- License file
pencilproject/ --- Rudimentary wireframe made in pencil project.
setup_app.sh --- Shell script to setup the entire web application.