-
Notifications
You must be signed in to change notification settings - Fork 0
Repository for scrapy spiders used for scraping web pages
License
industrydive/scrapy-scripts
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Getting started ============================ Create and activate a python virtual environment Install requirements: $ pip install -r requirements.txt Running a Tradeshow Scrape ============================ For the most part, the generic spider at scraper/spiders/generic_tradeshow_spider.py should be able to handle most scraping requests that we get. If you need to scrape something that follows the same format as sites like http://s23.a2zinc.net/clients/lrp/hrtechnologyconference2017/Public/exhibitors.aspx?Index=All or http://events.pennwell.com/DTECH2018/Public/exhibitors.aspx?_ga=2.91461086.575732828.1507662078-248451487.1507662078 then you should be able to use this as-is by running the wrapper script: ./scrape.sh <your-start-url> This script will run the "tradeshow" spider and output CSV to a file named tradeshow-scrape.csv. If tradeshow-scrape.csv already exists, it will be overwritten with each run. Otherwise, you may need to create a custom spider like in nrf2018_custom_spider.py Scrapy Basics ============================ Callback functions ---------------------------- Each Rule should have a callback function created for it - this is what will be executed on the HTML of a resulting page when scrapy follows a link. Items ---------------------------- For each type of item you need to capture information about, add an Item to scraper/items.py. This is where you define the fields you want to store data in for each item and how to populate those fields. Your Rule callback functions should invoke these items. Running a scraper ============================ $ cd scraper $ scrapy crawl nameofyourspider \ --output=location-of-output-file --output-format=[csv,jl,etc] example: $ scrapy crawl hrtech2017 --output-format=csv --output=hrtech2017.csv For more details and tutorials, see the scrapy documentation: https://doc.scrapy.org/
About
Repository for scrapy spiders used for scraping web pages
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published