Skip to content

ZenRows/stormcrawler-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains code associated with ZenRows' blog.

The framework was generated with the StormCrawler Maven Archetype and customized into a full-fledged web crawler that stores data into CSV.

Prerequisites

You need to install Apache Storm. The instructions on setting up a Storm cluster should help. Alternatively, the stormcrawler-docker project contains resources for running Apache Storm on Docker.

Build the Project

Build the project framework with Maven using the following command:

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0

Compilation

Generate an uberjar with

mvn clean package

Running the crawl

You can now submit the topology using the storm command:

python storm.py local <FULL_PROJECT_PATH>target/tormcrawler-tutorial-1.0.jar --local-ttl 60 com.tutorial.CrawlTopology -- -conf <FULL_PROJECT_PATH>/crawler-conf.yaml

This will run the topology in local mode for 60 seconds. Simply use the 'storm jar' to start the topology in distributed mode, where it will run indefinitely.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published