mini_google/rust_crawler at main · maxymkuz/mini_google

History

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
.dockerignore		.dockerignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
ca-certificates.crt		ca-certificates.crt
urls.txt		urls.txt

README.md

Parallel crawler for structured data in Rust

Docker image is just 14mb thanks to the static compilation, multi-stage build and scratch image.

Usage

# A quick test build of the rust executable
# The CLI executable displays its progress
# 'l' chooses a maximum number of webpages to crawl (default is 1024)
# 'r' chooses a maximum number of repeats for failed pages (default is 3)
cargo run --release -- -i file.txt -o out.txt -t 8 -l 300 -r 5

# A help page with CLI parameters' descriptions
cargo run -- --help

# Build and run a Docker image
# This build is relatively slow and produces a static executable
# The progress is not displayed correctly in the docker container
./build.sh
docker run --rm lastgenius/rust-crawler -i file.txt -o out.txt -t 8

Examples

A few files that this monster has spent its time on:

You can play around with these files, better to use different CLI apps to keep yourself sane:

Sort lines, keep unique ones: sort file.txt -o out.txt -u
Shuffle lines shuf file.txt -o out.txt

And so on...

Architecture

This is a rough outline of how this crawler works. I will try to update it in the case of any major changes, but it's always better to check out the module's documentation itself in case you need to understand everything better and on a deeper level.

You can build the documentation locally with cargo doc --open, and view it by choosing the crate from the menu on the left.

Overall the program works like this, with a single main thread starting off, reading user input from the command line, reading an input file with webpage links, and then creating a thread pool of workers, establishing several MPSC (Multiple Producers Single Consumer) channels between the main thread and threads in the pool:

URL channel - through which the main thread sends URLs to crawl to the thread pool. Has to be protected by a mutex (essentially is a queue with a lock).
New URL channel - works as intended, with threads in the pool sending new URLs which they acquire during the crawl.
Structured data channel - works as intended, with threads in the pool sending structured data acquired on a page, if any.

Each worker is just an asynchronous single-thread tokio runtime that tries to get a vector of new URLs to crawl through the URL channel, and then asynchonously shoots off requests and gets data, which is then parsed and sent back to the main thread.

If URLs error out, the main thread is going to repeat this URL a few times, waiting for exponentially more time between attempts, before discarding the URL altogether.

Resources to quickly pick up what's going on here

Some useful resources on Rust in general, as well as on concurrency and web:

A pretty useful tutorial
Rust Cookbook's examples:
- Concurrency
- Database
- Networking
- Web
Tokio tutorial
std::thread documentation
std::sync::mpsc documentation
Async book

I'm using these libraries (Rust calls them crates):

And also these ones for nice debug print and argument parsing:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rust_crawler

rust_crawler

README.md

Usage

Examples

Architecture

Resources to quickly pick up what's going on here

Files

rust_crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

rust_crawler

Folders and files

parent directory

README.md

Usage

Examples

Architecture

Resources to quickly pick up what's going on here