GitHub - tbell511/metascraper: Scrape data from websites using Open Graph metadata, regular HTML metadata, and a series of fallbacks.

A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

Getting Started

metascraper is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

Have a high accuracy for online articles by default.
Make it simple to add new rules or override existing ones.
Don't restrict rules to CSS selectors or text accessors.

Installation

$ npm install metascraper --save

Usage

Let's extract accurate information from the following article:

const metascraper = require('metascraper')
const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const {body: html, url} = await got(targetUrl)
  const metadata = await metascraper({html, url})
  console.log(metadata)
})()

Where the output will be something like:

{
  "author": "Ellen Huet",
  "date": "2016-05-24T18:00:03.894Z",
  "description": "The HR startups go to war.",
  "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
  "publisher": "Bloomberg.com",
  "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
  "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

Metadata

Here is a list of the metadata that metascraper collects by default:

author — eg. Noah Kulwin
A human-readable representation of the author's name.
date — eg. 2016-05-27T00:00:00.000Z
An ISO 8601 representation of the date the article was published.
description — eg. Venture capitalists are raising money at the fastest rate...
The publisher's chosen description of the article.
video — eg. https://assets.entrepreneur.com/content/preview.mp4
A video URL that best represents the article.
image — eg. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg
An image URL that best represents the article.
lang — eg. en
An ISO 639-1 representation of the url content language.
logo — eg. https://entrepreneur.com/favicon180x180.png
An image URL that best represents the publisher brand.
publisher — eg. Fast Company
A human-readable representation of the publisher's name.
title — eg. Meet Wall Street's New A.I. Sheriffs
The publisher's chosen title of the article.
url — eg. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion
The URL of the article.

How it works

?> Configuration file follow the same approach than projects like Babel or Prettier.

metascraper is built out of rules.

It was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.

Rules are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading core rules.

Each set of rules load a set of selectors in order to get a determinate value.

These rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.

Rules work as fallback between them:

If the first rule fails, then it fallback in the second rule.
If the second rule fails, time to third rule.
etc

metascraper do that until finish all the rule or find the first rule that resolves the value.

Loading rules

When you call metascraper in your code, a set of core rules are loaded by default.

Although these rules are sufficient for most cases, metascraper was designed to be easy to adapt and load more or custom rules set.

We provide two approach for do that.

Configuration file

This consists in declaring a configuration file that contains the names of the rule sets corresponding to npm packages that metascraper will be load automagically.

The configuration file could be declared via:

A .metascraperrc file, written in YAML or JSON, with optional extensions: .yaml, .yml, .json and .js.
A metascraper.config.js file that exports an object.
A metascraper key in your package.json file.

The configuration file will be resolved starting from the location of the file being formatted, and searching up the file tree until a config file is (or isn't) found.

The order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Basic Configuration

Declared an array of rules, specifying each rule as string name of the module to load.

JSON

// .metascraperrc
{
  "rules": [
    "metascraper-author",
    "metascraper-date",
    "metascraper-description",
    "metascraper-image",
    "metascraper-lang",
    "metascraper-logo",
    "metascraper-publisher",
    "metascraper-title",
    "metascraper-url"
  ]
}

YAML

#  .metascraperrc
rules:  
  - metascraper-author
  - metascraper-date
  - metascraper-description
  - metascraper-image
  - metascraper-lang
  - metascraper-logo
  - metascraper-publisher
  - metascraper-title
  - metascraper-url

Advanced Configuration

Additionally, you can pass specific configuration per module using a object declaration:

JSON

// .metascraperrc
{
  "rules": [
    "metascraper-author",
    "metascraper-date",
    "metascraper-description",
    "metascraper-image",
    "metascraper-lang",
    "metascraper-logo",
    {"metascraper-clearbit-logo": {
    "format": "jpg"
    }},
    "metascraper-publisher",
    "metascraper-title",
    "metascraper-url"
  ]
}

YAML

# .metascraperrc
rules:
  - metascraper-author
  - metascraper-date
  - metascraper-description
  - metascraper-image
  - metascraper-lang
  - metascraper-clearbit-logo:
      format: jpg
  - metascraper-publisher
  - metascraper-title
  - metascraper-url

Constructor

If you need more control, you can load the rules set calling directly the metascraper constructor .load:

const metascraper = require('metascraper').load([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit-logo')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options if you need it:

const metascraper = require('metascraper').load([
  require('metascraper-clearbit-logo')({
    size: 256,
    format: 'jpg'
  })
])

Using this way you are not limited to load just npm modules as rules set. For example, you can load a custom file of rules:

const metascraper = require('metascraper').load([
  require('./my-custom-rules-file')()
])

Rules

?> Can't find a rules set that you want? Let's open an issue to create it.

Core rules

These rules set will be shipped with metascraper and loaded by default.

Package	Version	Dependencies
`metascraper-author`
`metascraper-date`
`metascraper-description`
`metascraper-video`
`metascraper-image`
`metascraper-logo`
`metascraper-publisher`
`metascraper-title`
`metascraper-url`

Community rules

These rule set will not be shipped with metascraper by default and need to be specific using a configuration file.

Package	Version	Dependencies
`metascraper-amazon`
`metascraper-clearbit-logo`
`metascraper-logo-favicon`
`metascraper-soundcloud`
`metascraper-youtube`

Write your own rules

A rule set is the simplest way for extending metascraper functionality.

A rule set can add one or more properties support.

The following schema represents the API compromise that a rule set need to follow:

'use strict'

// `opts` can be loaded using `.metascraperrc` config file
module.exports = opts => {
  // You can define as props as you want.
  // props are organized based on an object key.
  return ({
    logo: [
      // You can setup more than one rules per prop (priority is important!).
      // They receive as parameter:
      // - `htmlDom`: the cheerio HTML instance.
      // - `url`: The input URL used for extact the content.
      // - `meta`: The current state of the information detected.
      ({ htmlDom: $, meta, url: baseUrl }) => wrap($ => $('meta[property="og:logo"]').attr('content')),
      ({ htmlDom: $, meta, url: baseUrl }) => wrap($ => $('meta[itemprop="logo"]').attr('content'))
    ]
  })
}

We recommend check core rules packages as examples.

API

metascraper(options)

options

html

Required
Type: String

The HTML markup for extracting the content.

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

metascraper.load(rules)

Create a new metascraper instance declaring the rules set to be used explicitly.

rules

Type: Array

The collection fo rules set to be loaded.

Environment Variables

METASCRAPER_CWD

Type: String
Default: process.cwd()

This variable is used to determine where starting search for a configuration object.

Comparison

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library	`metascraper`	`html-metadata`	`node-metainspector`	`open-graph-scraper`	`unfluff`
Correct	95.54%	74.56%	61.16%	66.52%	70.90%
Incorrect	1.79%	1.79%	0.89%	6.70%	10.27%
Missed	2.68%	23.67%	37.95%	26.34%	8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper © Ian Storm Taylor, Released under the MIT License.
Maintained by Kiko Beats with help from contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
.github		.github
bench		bench
packages		packages
src		src
static		static
.bumpedrc		.bumpedrc
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
.npmrc		.npmrc
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CNAME		CNAME
LICENSE.md		LICENSE.md
README.md		README.md
gulpfile.js		gulpfile.js
index.html		index.html
lerna.json		lerna.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Getting Started

Installation

Usage

Metadata

How it works

Loading rules

Configuration file

Basic Configuration

JSON

YAML

Advanced Configuration

JSON

YAML

Constructor

Rules

Core rules

Community rules

Write your own rules

API

metascraper(options)

options

html

url

metascraper.load(rules)

rules

Environment Variables

METASCRAPER_CWD

Comparison

License

About

Releases

Packages

Languages

License

tbell511/metascraper

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Getting Started

Installation

Usage

Metadata

How it works

Loading rules

Configuration file

Basic Configuration

JSON

YAML

Advanced Configuration

JSON

YAML

Constructor

Rules

Core rules

Community rules

Write your own rules

API

metascraper(options)

options

html

url

metascraper.load(rules)

rules

Environment Variables

METASCRAPER_CWD

Comparison

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages