get37 is a Scala / ZIO based web scraper/spider built as part of technical assignment at 13|37.
After the project is assembled (instructions) into "über-JAR", you can simply use it like this:
$ java -jar target/*/get37.jar https://tretton37.com
$ java -jar target/*/get37.jar --maxFibers 10 --preFetchDelay 70 --maxDepth 4 https://zio.dev
$ java -jar target/*/get37.jar --help # for more help
get37 currently supports three configuration flags that can be passed along when the tool is started.
maxFibers
, set to10
by default tells the ZIO runtime how many concurrent fibers can be used when sub-requests are beeing made.preFetchDelay
, set to10
milliseconds by defaul, adds a time delay before the sub-sequential requests are made.maxDepth
, set to3
by default will serve as hard-limit when the spider tries to go deeper into the sites structure.
This project uses Nix Shell (shell.nix) for project dependencies management. JDK and SBT are only dependencies.
$ sbt "run https://tretton37.com"
To build "über-JAR" this project uses sbt-assembly and sbt-native-packager plugins.
$ sbt assembly
$ java -jar target/*/get37.jar
This project also comes with tests that can be invoked with SBT
and CircleCI setup.
$ sbt test
- zio - High-performance, type-safe, composable asynchronous and concurrent programming library and framework for Scala.
- zio-cli - Powerful command-line applications framework for ZIO.
- zio-http (ex-zhttp) - A scala library for building HTTP apps. It is powered by ZIO and Netty and aims at being the defacto solution for writing, highly scalable and performant web applications using idiomatic Scala.
- jsoup - is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. Although in this project is only used for content/link extraction.
- os-lib - a simple, flexible, high-performance Scala interface to common OS filesystem and subprocess APIs