Prerequisites

This repository contains code associated with ZenRows' blog.

The framework was generated with the StormCrawler Maven Archetype and customized into a full-fledged web crawler that stores data into CSV.

Prerequisites

You need to install Apache Storm. The instructions on setting up a Storm cluster should help. Alternatively, the stormcrawler-docker project contains resources for running Apache Storm on Docker.

Build the Project

Build the project framework with Maven using the following command:

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0

Compilation

Generate an uberjar with

mvn clean package

Running the crawl

You can now submit the topology using the storm command:

python storm.py local <FULL_PROJECT_PATH>target/tormcrawler-tutorial-1.0.jar --local-ttl 60 com.tutorial.CrawlTopology -- -conf <FULL_PROJECT_PATH>/crawler-conf.yaml

This will run the topology in local mode for 60 seconds. Simply use the 'storm jar' to start the topology in distributed mode, where it will run indefinitely.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
src/main		src/main
target		target
.gitignore		.gitignore
README.md		README.md
crawler-conf.yaml		crawler-conf.yaml
crawler.flux		crawler.flux
pom.xml		pom.xml
seeds.txt		seeds.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

Build the Project

Compilation

Running the crawl

About

Releases

Packages

Languages

ZenRows/stormcrawler-tutorial

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Build the Project

Compilation

Running the crawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages