Smart crawler of structured websites

A smart distributed crawler that infers navigation models of structured websites.

The crawling process is made up of in 2 phases:

Given a list of entrypoints (homepage URLs), the navigation model of each website is automatically inferred by exploring a limited, yet representative, sample of theirs HTML pages. The generated models divide each website in classes of similarly structured pages, called Page Classes, and describe the properties of links between different Page Classes.
The so generated models are then used to perform an extensive crawling of the websites, so that each URL can be associated with the Page Class it belongs to: the output consists of webpages clustered according to their HTML structure.

Use Cases

Navigation models and clustered pages can be both useful for different use cases.

The pages crawled in Phase 1, used to infer the model, can be used as samples to generate a wrapper to extract structured data from Page Classes of interest.
Inferred models can be used in Phase 2 to explore only a portion of a website containing data of interest, rather than all the URLs. This is done by following only the links that will take the crawler to the pages of specific Page Class. For example, if you want to crawl an e-commerce website but you are interested only to the pages showing a product, then you can follow just the paths leading to the "Product" Page Class.
Since at the end of Phase 2 structurally similar pages are clustered together, it is possible to improve the compression of large archives of HTML pages by extracting the common template code of pages within the same class.
Inferred models can be re-generated over time, in order to identify the Page Classes that are more likely to be frequently updated by comparing different versions of the model of the same website. This is useful in case of an incremental crawling scenario where one wants to periodically re-crawl a website, avoiding the download of pages that have not changed their content since the last visit.
Users can also skip phase 1 and write partial navigation models by hand, using XPath rules, that can be used to both specify how to navigate the input websites (navigation rules that match links in a page) and how to extract data of interest (extraction rules that match data in a page). An example of navigation models with navigation and extraction rules can be found HERE.

Example of a navigation model of an e-commerce website

Crawler Architecture

This project has been developed with the actor model framework Akka. Here is a summary of the second phase crawler architecture showing the main components and messages.

Installation

You'll need JRE 1.8 or higher, maven and neo4j

Clone the project
Import it as a Maven Project in Eclipse or similar IDE
Run a Neo4J instance and provide your configuration in ogm.properties

Usage

Edit master.conf with your configuration (you can find some explanation on how to set the crawler in the file itself).
Start slave nodes, if any, running RepositoryMain.java on each node with a progressive node number as argument.
Run MasterMain.java on the master node.

Output

Inferred models are stored in targets.
Every dowloaded HTML page is stored in "html" directory in the root of the project.
For each website, a CSV file of triples "URL|PAGE CLASS|LOCAL FILE PATH" is stored in repository.

TODO

Fix frontier limit
Fix regex for homepage URL
Create an option for images' download
Check and document master.conf configurations
Better check and document tests

Acknowledgments

Huge thanks to Fabio Cibecchini for starting this project!

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
docs		docs
scripts		scripts
src		src
testsite		testsite
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart crawler of structured websites

Use Cases

Example of a navigation model of an e-commerce website

Crawler Architecture

Installation

Usage

Output

TODO

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

VAle512/smart-crawler

Folders and files

Latest commit

History

Repository files navigation

Smart crawler of structured websites

Use Cases

Example of a navigation model of an e-commerce website

Crawler Architecture

Installation

Usage

Output

TODO

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages