Skip to content

Latest commit

 

History

History

unit2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Unit 2: Navigating websites with Scrapy

This unit build upon the previous one and covers how to crawl websites with Scrapy. Crawling a website means basically following the links found in the pages, so that the spider visits all the pages it needs.

Topics

  • Link crawling
  • Crawling settings

Check out the slides for this unit

Sample Spiders

  1. Spider that follows pagination links to scrape quotes.toscrape.com: spider_1_quotes_pagination.py
  2. Spider that extracts authors' data from details pages in quotes.toscrape.com: spider_2_authors_details.py
  3. Spider that extracts the quotes alongside authors's information from quotes.toscrape.com: spider_3_quotes_authors.py

Hands-on

1. Books Crawler

Build a spider to extract title, price (float) and stock from all the 1000 books available in books.toscrape.com.

Check out the spider once you're done.

2. Blog.scrapinghub.com Crawler

Build a spider that extracts the following data from blog.scrapinghub.com posts: post title, URL, author name, date and post tags. Your spider should not extract posts that have no tags associated.

Check out the spider once you're done.

References