Mercator

Mercator was a Belgian geographer, but it is also DNS Belgium's crawler project.

Given a domain name, Mercator gathers public information from different sources: HTML, DNS records, SMTP servers, TLS certificates, etc.

Architecture

Goals

The architecture is optimized to induce following properties in the system:

performance: we want to be able to crawl our zone (.be: ~1.7 million domain names) within a few days, with a cost that is comparable to existing crawlers for similar features
scalability: we want to be able to deploy the crawler in at least two ways: 'listening mode', which reacts to domain events of the registration platform like new registrations (~1 visit/minute, optimized for lead time), and ' batch mode', which crawls the entire zone in minimal time (~1 full crawl/month, which should gain a speed of ~1000 visits/minute to crawl the zone within a day, optimize for throughput). We don't want to pay all the time for our peak usage.
extensibility: introducing new components should not disrupt the working of other components. We don't need this during runtime, we can afford to take the system offline for upgrades. A second part of extensibility is that we want to be able to share the code with other registries, which implies that we shouldn't have .be-specific code in this repository. And thus we have a need to be able to add behaviour to the system from multiple code bases.
robustness: more in the design than the architecture viewpoint. The internet is a wild place. Be aware that you'll encounter situations you would not think possible out there. Individual components should be able to handle them. On an architectural level, we don't think there is a need to support more than usual robustness against missing or broken components of the crawler, as long as the system is extensible.
observability: If things go wrong, both functionally and performance-wise, we want to quickly locate problematic points

Model

The current architectural model is service-oriented, networked and event-driven. Multiple modules act on their own interests. The primary communication between modules happens over queues. Other modules can react to these events in their own processing and as a result announce new events.

The end goal of a crawler is to collect data, so all crawl results are written into a shared datastore, which allows easier manual consumption of the data. Each module has its own schema. Over time, we should make clear distinctions between temporary, internal module state and stable, consumable data.

Subprojects

content-crawler (Java / Spring Boot): Content-related crawling. Interacts over SQS with muppets to fetch and retrieve content. Stores the links to relevant S3 files in the DB
dispatcher (Java / Spring Boot): Spring Boot app that serves as API entrypoint to the crawler system and will copy messages to the input queues of all modules that need it.
dns-crawler (Java / Spring Boot): DNS-related crawling. Fetch DNS records, post-process with geo-ip.
feature-extraction(Java / Spring Boot): extracts features from data collected by other modules. For now only extracts features from the HTML.
ground-truth(Java / Spring Boot): Ground-truth only creates tables used for labeling the mercator visits.
mercator-api (Java / Spring Boot): Expose API via Spring Data REST, using the JPA entities of other modules.
mercator-ui (NodeJS / TypeScript / React): Basic UI for Mercator
mercator-wappalyzer (NodeJS / TypeScript): Wrapper around Wappalyzer, using to detect technologies in websites.
muppets (NodeJS / TypeScript): Takes screenshots via puppeteer, interacts over SQS and stores results in S3
observability: code to support the observability property of the system
smtp-crawler (Java / Spring Boot): Stores information about SMTP servers.
tls-crawler (Java / Spring Boot): Gathers TLS certificates and extracts data.
vat-crawler (Java / Spring Boot): Tries to find VAT number (belgian only for now) on a website.
mercator-cli (Java / Spring Boot): Command-line interface to interact with the SQS queues of the different modules.

Data Model

All modules currently write into a Postgres SQL database. See Data Catalog for more details.

Warning on usage

Be aware that running Mercator might trigger some alarms to the crawled domains' servers. Some modules open a lot of parallel connections and might be seen as scanning tool or even DoS attack. We plan to implement rate-limits and/or cache-lookups when that is possible. Keep that in mind when changing configuration values such as timeouts, ratelimits, max value for Horizontal Pod Autoscaling (HPA), etc.

DNS Belgium is not responsible for any complaints you could receive in the result of using Mercator.

Requirements

Java 17
Docker & Docker Compose
Helm

Builds with gradle, bundled as gradle wrapper. Uses Spring Boot and Spring Cloud AWS/SQS/Spring messaging/Spring JMS.

We currently use LocalStack locally and AWS SQS in the cloud for cloud stream implementation, and Postgres as data store implementation.

Maxmind GeoIP license

In order to use the Geo IP capacity in the crawler, you need to register and generate a key at Maxmind. You can then enable GeoIP in dns-crawler and smtp-crawler and put you license key there. (see properties files).

Usage

Run start-all.sh to start the entire system using docker-compose.

Crawling

Once all services are running, start a crawl by sending a message to the dispatcher input queue:

aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dispatcher-input --message-body '{"domainName": "abc.be"}'

or if you want a label, use

aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dispatcher-input --message-body '{"domainName": "abc.be", "labels": ["some-label"]}'

After some time, results should start appearing in the database. You can surf to http://localhost:5050/?pgsql=db&username=postgres&db=postgres to inspect the database using the bundled database viewer.

It's best to always go via the Dispatcher. But if you really want to, you can send a message directly to the input queue of a single module. You have to add a valid visit_id in that case:

aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-content-crawler-input --message-body '{"visitId":"f1c25d2f-3248-4013-a73a-307f91ee014b", "domainName": "abc.be"}'

Or

aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dns-crawler-input --message-body '{"visitId":"f1c25d2f-3248-4013-a73a-307f91ee014a", "domainName": "abc.be"}'

Development

For local development, it makes more sense to run our services outside the docker network, directly on our machine.

Run start-deps.sh to start only the dependencies.

Use the 'local' profile to start individual components.

The local profile (Spring) by default uses the following ports

8082 : DNS crawler
8083 : SMTP crawler
8084 : content crawler
8085 : muppets
8086 : dispatcher
8087 : wappalyzer
8088 : feature-extraction
8089 : mercator-api
8090 : mercator-ui
8091 : vat-crawler
8092 : tls-crawler

Or use intellij to run/debug one of the Application classes. Do not forget to specify the local profile in this case.

Prometheus : http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)

Deployment

Mercator is built to run on Kubernetes. This allows to scale modules of mercator easily and according to the load.

Build containers

The build of the containers for the Spring Boot apps is done using jib. The other modules uses Docker cli to build the container.

Each module has a gradle task dockerBuild that will create the docker container.

Helm charts

Each module has its Helm chart. We use Helm to deploy the app in Kubernetes.

Local Kubernetes

It is possible to locally create a Kubernetes cluster with docker-desktop

Infrastructure as code

Infrastructure as code is stored in another git repo.

setup Prometheus and grafana

TODO

Contributing

If you are interested in contributing to Mercator:

Start by reading our contributing guide.
Navigate our codebase and open issues.

We are thankful for all the contributions and feedback we receive.

Contributors

This tool is initially developed by DNS Belgium.

License

This version of Mercator is released under the Apache License, Version 2.0 (see LICENSE.txt).

Name		Name	Last commit message	Last commit date
Latest commit History 904 Commits
.github/workflows		.github/workflows
common-geoip		common-geoip
common-messaging-aws		common-messaging-aws
common-messaging		common-messaging
common-testing		common-testing
content-crawler-dto		content-crawler-dto
content-crawler-persistence		content-crawler-persistence
content-crawler		content-crawler
dispatcher-persistence		dispatcher-persistence
dispatcher		dispatcher
dns-crawler-dto		dns-crawler-dto
dns-crawler-persistence		dns-crawler-persistence
dns-crawler		dns-crawler
doc		doc
docker-base-image		docker-base-image
feature-extraction-persistence		feature-extraction-persistence
feature-extraction		feature-extraction
gradle/wrapper		gradle/wrapper
ground-truth		ground-truth
images		images
localstack		localstack
mercator-api		mercator-api
mercator-cli		mercator-cli
mercator-helm-umbrella		mercator-helm-umbrella
mercator-throttling-appender		mercator-throttling-appender
mercator-ui		mercator-ui
mercator-wappalyzer		mercator-wappalyzer
muppets		muppets
observability		observability
postgresql/helm/postgresql		postgresql/helm/postgresql
smtp-crawler-dto		smtp-crawler-dto
smtp-crawler-persistence		smtp-crawler-persistence
smtp-crawler		smtp-crawler
tls-crawler-persistence		tls-crawler-persistence
tls-crawler		tls-crawler
useful-sql-queries		useful-sql-queries
vat-crawler-persistence		vat-crawler-persistence
vat-crawler		vat-crawler
.conftestignore		.conftestignore
.gitignore		.gitignore
.java-version		.java-version
CONTRIBUTING.md		CONTRIBUTING.md
Jenkinsfile		Jenkinsfile
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle		build.gradle
docker-compose.yml		docker-compose.yml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
lombok.config		lombok.config
run_helm.sh		run_helm.sh
settings.gradle		settings.gradle
start-all.sh		start-all.sh
start-deps.sh		start-deps.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mercator

Architecture

Goals

Model

Subprojects

Data Model

Warning on usage

Requirements

Maxmind GeoIP license

Usage

Crawling

Development

Deployment

Build containers

Helm charts

Local Kubernetes

Infrastructure as code

setup Prometheus and grafana

Contributing

Contributors

License

About

Releases

Packages

Contributors 8

Languages

License

DNSBelgium/mercator

Folders and files

Latest commit

History

Repository files navigation

Mercator

Architecture

Goals

Model

Subprojects

Data Model

Warning on usage

Requirements

Maxmind GeoIP license

Usage

Crawling

Development

Deployment

Build containers

Helm charts

Local Kubernetes

Infrastructure as code

setup Prometheus and grafana

Contributing

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages