pipeline

news-please pipeline

The news-please pipeline offers several modules for processing, filtering and storing the results of the crawlers. This section explains the different pipeline modules and their configuration.

Processing
- ArticleMasterExtractor
Filter
- Date filter
- HTML code handling
Storage
- Local storage
- Elasticsearch storage
  - MySQL storage
  - RSS crawl compare

Processing

ArticleMasterExtractor

Module path: newscrawler.pipeline.pipelines.ArticleMasterExtractor
Functionality:
The ArticleMasterExtractor bundles several tools into one pipeline module in order to extract meta data from raw articles. Based on the html response of the processed pipeline item it extracts:
- author
- date the article was published
- article title
- article description
- article text
- top image
- used language
Configuration:
While the module works fine with the default settings, it is possible reconfigure the tools used in the extraction process. These changes can be performed in the ArticleMasterExtractor-section of the config file.

More detailed information about the module and the incorporated extractors can be found here.

Filter

Date filter

Module path: newscrawler.pipeline.pipelines.DateFilter
Functionality:
This module filters the extracted articles based on their publishing date. It allows to filter all articles younger than a start date and/or older than an end date. It also implements a strict mode that dropps all articles without an extracted publishing date.
Requirements:
Due to need of meta data (the publishing date), the module only functions if placed behind an suitable extractor in the pipeline.
Configuration:
The configuration is done in the DateFilter Section of newscrawler.cfg:
```
 #!python
 [DateFilter]
 start_date = '1999-01-01 00:00:00'
 end_date = '2999-12-31 00:00:00'  

 strict_mode = False
```
Dates can be either None or date string with the following format: 'yyyy-mm-dd hh:mm:ss'

HTML code handling

Module path: newscrawler.pipeline.pipelines.HMTLCodeHandling
Functionality:
This Module checks the server responses and drops the processed site if the request was not accepted. As of 22.06.16 this module is not active, but serves as an example pipeline module.

Storage

Local storage

Module path: newscrawler.pipeline.pipelines.LocalStorage
Functionality:

Elasticsearch storage

Module path: newscrawler.pipeline.pipelines.ElasticsearchStorage

Functionality:
This Modules stores the extracted data in a given Elasticsearch database. It manages two separate indices, one for current articles and one to archive previous versions of updated articles. Both indices use the following default mapping to store the articles and extracted meta data:

 mapping = {
     'url': {'type': 'string', 'index': 'not_analyzed'},
     'sourceDomain': {'type': 'string', 'index': 'not_analyzed'},
     'pageTitle': {'type': 'string'},
     'rss_title': {'type': 'string'},
     'localpath': {'type': 'string', 'index' : 'not_analyzed'},
     'ancestor': {'type': 'string'},
     'descendant': {'type': 'string'},
     'version': {'type': 'long'},
     'downloadDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
     'modifiedDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
     'publish_date': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
     'title': {'type': 'string'},
     'description': {'type': 'string'},
     'text': {'type': 'string'},
     'author': {'type': 'string'},
     'image': {'type': 'string', 'index' : 'not_analyzed'},
     'language': {'type': 'string', 'index' : 'not_analyzed'}
     }

Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the Elasticsearch Section of newscrawler.cfg. There you can also alter the name of the indices and the mapping used to store the article data.

MySQL storage

Module path: newscrawler.pipeline.pipelines.MySQLStorage
Functionality:
This Modules stores the extracted data in a given MySQL or MariaDB database. It manages two separate tables, one for current articles and one to archive previous versions of updated articles:
Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. There is also a setup script init-db.sql for a convenient creation of the used tables.

RSS crawl compare

Module path: newscrawler.pipeline.pipelines.RSSCrawlCompare
Functionality:
Similar to the MySQL storage module, this module works with MySQL or MariaDB databases. But different to the MySQL module, it only works with articles returned from the Rss crawler.

For every passed article the module looks for an older version in the database and updates the Fields if certain time has passed since the last update/download. This module won't save new articles and is only meant to keep the database up to date.
Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. To setup the used tables simply execute the provided setup script init-db.sql. You can also alter the interval articles are updated with the hours_to_pass_for_redownload_by_rss_crawler -parameter in the Crawler section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly