Skip to content

Commit

Permalink
Normalize url while scraping, add one time database fix helper
Browse files Browse the repository at this point in the history
  • Loading branch information
robbi5 committed Sep 15, 2014
1 parent 39f1d79 commit 32b4b31
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 1 deletion.
3 changes: 3 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,6 @@ gem 'wombat', '~> 2.2.1'

# slugs
gem 'friendly_id', '~> 5.0.0'

# fix urls while scraping
gem 'addressable', '~> 2.3.6', require: "addressable/uri"
2 changes: 2 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ GEM
minitest (~> 5.1)
thread_safe (~> 0.1)
tzinfo (~> 1.1)
addressable (2.3.6)
arel (5.0.1.20140414130214)
builder (3.2.2)
coffee-rails (4.0.1)
Expand Down Expand Up @@ -149,6 +150,7 @@ PLATFORMS
ruby

DEPENDENCIES
addressable (~> 2.3.6)
coffee-rails (~> 4.0.0)
friendly_id (~> 5.0.0)
jbuilder (~> 2.0)
Expand Down
9 changes: 9 additions & 0 deletions app/models/paper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,13 @@ def normalize_friendly_id(value)
def full_reference
legislative_term.to_s + '/' + reference.to_s
end


# helper method to fix non-standard urls in the database
# apply it with: Paper.find_each(&:normalize_url)
def normalize_url
normalized_url = Addressable::URI.parse(self.url).normalize.to_s
write_attribute(:url, normalized_url)
save!
end
end
4 changes: 3 additions & 1 deletion app/scrapers/bayern_landtag_scraper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ def scrape
Date.parse(text.match(/([\d\.]+)$/)[1]) unless text.nil?
end
url 'xpath=.//a[not(contains(@href, "LASTFOLDER"))]/@href' do |href|
BASE_URL + href unless href.nil?
unless href.nil?
Addressable::URI.parse(BASE_URL + href).normalize.to_s
end
end
#text 'xpath=(following-sibling::tr[2]/td[contains(@class, "pad_bot0")])[1]'
title 'xpath=following-sibling::tr[2]/td[3]' do |text|
Expand Down

0 comments on commit 32b4b31

Please sign in to comment.