Skip to content

My attempt and building a configurable scraper... still busy

Notifications You must be signed in to change notification settings

abismail/py-configurable-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

py-configurable-scraper

This python script scrapes any website you add a scraping configuration for and stores it in a mongodb (saving to mongodb will be removed)

Setup

The following dependencies must be installed to run this:

Usage

For each website you want to scrape you have to specify a url in the sites-to-scrape dict object.

And then you're going to give the scraper a map to step through, and tell the scraper which data you want extracted. For example if we want to scrape the website at 'http://www.webtickets.co.za/' for all the shows they're selling tickets, our map will look like this:

'webtickets'		: {
		'main_content'	: {
			'div': {
				'properties': { 'id' : 'event_list' }
			}
		},
		'event_card'	: {
			'a': {}
		},
		'paging'		: {
			'type'	: 'post',
			'vars'	: {
				'currentpage': 'page'
			}
		},
		'card_map'		: {
			'name'			: {
				'div'	: {
					'properties' : { 'class' : 'mainContentBlockTextHeading' }
				}
			},
			'image'			: {
				'img'	: {
					'attr' 		: 'src',
					'prepend'	: 'https://www.webtickets.co.za/'
				}
			},
			'link'			: {
				'attr'		: 'href',
				'prepend'	: 'https://www.webtickets.co.za'
			},
			'date'			: {
				'div'	: {
					'properties' : { 'class' : 'mainContentBlockText' },
					'child' 	 : {
						'div' : {
							'index' : 1 #tell bs to use the second find of this yield :: if there's an index AND a child, we first list child, THEN index
						}
					}

				}
			}
		}
	}

About

My attempt and building a configurable scraper... still busy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages