Save your favourite blog or podcast websites as RSS feeds

blog_crawler is a set of ruby scripts that can turn your favourite blog or podcast websites into RSS feeds.

See examples of RSS feeds (feeds.txt, slice-[0-9].xml) & podcast URLs (mp3-urls.txt) generated by blog_crawler here.

Blogs & Podcasts with built-in supported

How to crawl websites and generate RSS feeds with blog_crawler

Let's use blog Coding Horror as an example.

1. Create custom page and post objects along with `config.json` for the website

Coding Horror page and post object:

# blogs/coding_horror/coding_horror_page.rb
class CodingHorrorPage < Page
  def initialize(page_url, page_html)
    super(page_url, page_html)
    next_page_url_node = page_html.css(".left .read-next-title a").first
    previous_page_url_node = page_html.css(".right .read-next-title a").first
    @next_page_url = "https://blog.codinghorror.com#{ next_page_url_node.attributes["href"].value }" unless next_page_url_node.nil?
    @previous_page_url = "https://blog.codinghorror.com#{ previous_page_url_node.attributes["href"].value }" unless previous_page_url_node.nil?
    @post_urls = [@page_url]
  end
end

# blogs/coding_horror/coding_horror_posts.rb
class CodingHorrorPost < Post
  def initialize(post_url, post_html)
    super(post_url, post_html)
    @title = post_html.css(".post-title").text
    @published_date = post_html.at("meta[property='article:published_time']")['content']
    @content = post_html.css(".post-content").children
    @author = "Jeff Atwood"
  end
end

Config file:

// blogs/coding_horror/config.json
{
  "title": "Coding Horror",
  "description": "programming and human factors",
  "homepage": "https://blog.codinghorror.com",
  "direction": "previous",
  "remote_base_url": "https://raw.githubusercontent.com/goooooouwa/rss-feeds/master/coding_horror",
  "initial_page": "https://blog.codinghorror.com/building-a-pc-part-ix-downsizing/"
}

2. Fetch all pages you want to crawl from the website

echo "[]" > ./out/pages.json
ruby ./bin/run.rb page coding_horror  # will save pages as json in pages.json

3. Fetch all posts within the pages you want to crawl

echo "[]" > ./out/posts.json
ruby ./bin/run.rb post coding_horror  # will save posts found in pages.json as json in posts.json

4. Generate RSS feeds from the crawled posts

ruby ./bin/run.rb render coding_horror   # generate and save RSS feeds `slice-[0-9].xml` in config["our_dir"] (each slice has a set number of posts)

What to do next?

You can use blog2kindle to turn RSS feeds into ebooks which can be read on ebook readers, such as Kindle, Apple Books. See how it works here.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
bin		bin
blogs		blogs
out		out
src		src
test		test
.gitignore		.gitignore
.ruby-gemset		.ruby-gemset
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
errors.txt		errors.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Save your favourite blog or podcast websites as RSS feeds

Blogs & Podcasts with built-in supported

How to crawl websites and generate RSS feeds with blog_crawler

1. Create custom page and post objects along with `config.json` for the website

2. Fetch all pages you want to crawl from the website

3. Fetch all posts within the pages you want to crawl

4. Generate RSS feeds from the crawled posts

What to do next?

About

Releases

Packages

Languages

goooooouwa/blog_crawler

Folders and files

Latest commit

History

Repository files navigation

Save your favourite blog or podcast websites as RSS feeds

Blogs & Podcasts with built-in supported

How to crawl websites and generate RSS feeds with blog_crawler

1. Create custom page and post objects along with config.json for the website

2. Fetch all pages you want to crawl from the website

3. Fetch all posts within the pages you want to crawl

4. Generate RSS feeds from the crawled posts

What to do next?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Create custom page and post objects along with `config.json` for the website

Packages