Skip to content
This repository has been archived by the owner on Jan 16, 2020. It is now read-only.

enricogenauck/object-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Object Scraper

Description

Object scraper is a thin wrapper for hpricot to enable receipt-like extraction of ruby objects from various web sites.

Install

Gem

gem install object-scraper --source http://gemcutter.org

Rails

config.gem 'object-scraper', :source => 'http://gemcutter.org'

Example

class Entry < Object
  attr_accessor :text, :date
end

uri     = "http://twitter.com/twitter"
pattern = ".status"

Scraper.define(:twitter, :class => :entry, :source => uri, :node => pattern) do |s|
  s.text { |node| node.at(".entry-content").inner_html }
  s.date { |node| DateTime.parse(node.at(".timestamp")[:data][/\'.*\'/].delete("'")) }
end

@objects = Scraper.parse(:twitter)

If you define multiple scrapers, you can collect all their objects with one simple method

@objects = Scraper.parse_all

Advanced Example

It is possible to use other existing HTML parsers instead of hpricot. Just overwrite the according proc object.

require 'nokogiri'
Scraper.scrape_source_with = Proc.new { |source| Nokogiri::HTML(source) }

Scraper.define(:twitter, :class => :entry, :source => uri, :node => pattern) do |s|
  # initialize your objects here accordingly
end

Rails

All scraper definitions sitting in RAILS_ROOT/scrapers will be taken into account automatically when you use object-scraper as a gem in your rails project.

Author

About

Recipe like object extraction from HTML sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages