Skip to content

Latest commit

 

History

History
51 lines (32 loc) · 2.04 KB

README.md

File metadata and controls

51 lines (32 loc) · 2.04 KB

Unit 3: Running Spiders in the Cloud

This unit describes how to deploy scrapy spiders to Scrapy Cloud and how to leverage from this platform.

Topics

  • Introduction to Scrapy Cloud
  • Deploying spiders to Scrapy Cloud
  • Controlling spiders via command line
  • UI walkthrough

Check out the slides for this unit

Sample Spiders

  1. A simple project to demonstrate deploy: p1_first_deploy
  2. A project to deploy with dependencies: p2_dependencies
  3. A project to deploy with Python Scripts: p3_scripts

Hands-on

1. Deploy the books crawler

Deploy the crawler for books.toscrape.com built in unit 2 to Scrapy Cloud.

a. Run the spider without touching any settings b. Run the spider, but now with DOWNLOAD_DELAY = 1 set via web UI

Check out the project once you're done.

2. Reddit Ranker

Create a crawler to fetch the 100 hottest submissions from reddit.com/r/programming (to run on Scrapy Cloud).

After that, create a CLI app to fetch the scraped data from Scrapy Cloud and list the top 10 submissions from the latest crawl, based on the score below:

new_score = S * C * K

    S → current score on reddit
    C → number of comments
    K → original poster's comments karma

Check out the project once you're done.

References