blog_crawler is a set of ruby scripts that can turn your favourite blog or podcast websites into RSS feeds.
See examples of RSS feeds (feeds.txt
, slice-[0-9].xml
) & podcast URLs (mp3-urls.txt
) generated by blog_crawler here.
Let's use blog Coding Horror as an example.
Coding Horror page and post object:
# blogs/coding_horror/coding_horror_page.rb
class CodingHorrorPage < Page
def initialize(page_url, page_html)
super(page_url, page_html)
next_page_url_node = page_html.css(".left .read-next-title a").first
previous_page_url_node = page_html.css(".right .read-next-title a").first
@next_page_url = "https://blog.codinghorror.com#{ next_page_url_node.attributes["href"].value }" unless next_page_url_node.nil?
@previous_page_url = "https://blog.codinghorror.com#{ previous_page_url_node.attributes["href"].value }" unless previous_page_url_node.nil?
@post_urls = [@page_url]
end
end
# blogs/coding_horror/coding_horror_posts.rb
class CodingHorrorPost < Post
def initialize(post_url, post_html)
super(post_url, post_html)
@title = post_html.css(".post-title").text
@published_date = post_html.at("meta[property='article:published_time']")['content']
@content = post_html.css(".post-content").children
@author = "Jeff Atwood"
end
end
Config file:
// blogs/coding_horror/config.json
{
"title": "Coding Horror",
"description": "programming and human factors",
"homepage": "https://blog.codinghorror.com",
"direction": "previous",
"remote_base_url": "https://raw.githubusercontent.com/goooooouwa/rss-feeds/master/coding_horror",
"initial_page": "https://blog.codinghorror.com/building-a-pc-part-ix-downsizing/"
}
echo "[]" > ./out/pages.json
ruby ./bin/run.rb page coding_horror # will save pages as json in pages.json
echo "[]" > ./out/posts.json
ruby ./bin/run.rb post coding_horror # will save posts found in pages.json as json in posts.json
ruby ./bin/run.rb render coding_horror # generate and save RSS feeds `slice-[0-9].xml` in config["our_dir"] (each slice has a set number of posts)
You can use blog2kindle to turn RSS feeds into ebooks which can be read on ebook readers, such as Kindle, Apple Books. See how it works here.