Mercator was a Belgian geographer, but it is also DNS Belgium's crawler project.
Given a domain name, Mercator gathers public information from different sources: HTML, DNS records, SMTP servers, TLS certificates, etc.
The architecture is optimized to induce following properties in the system:
- performance: we want to be able to crawl our zone (.be: ~1.7 million domain names) within a few days, with a cost that is comparable to existing crawlers for similar features
- scalability: we want to be able to deploy the crawler in at least two ways: 'listening mode', which reacts to domain events of the registration platform like new registrations (~1 visit/minute, optimized for lead time), and ' batch mode', which crawls the entire zone in minimal time (~1 full crawl/month, which should gain a speed of ~1000 visits/minute to crawl the zone within a day, optimize for throughput). We don't want to pay all the time for our peak usage.
- extensibility: introducing new components should not disrupt the working of other components. We don't need this during runtime, we can afford to take the system offline for upgrades. A second part of extensibility is that we want to be able to share the code with other registries, which implies that we shouldn't have .be-specific code in this repository. And thus we have a need to be able to add behaviour to the system from multiple code bases.
- robustness: more in the design than the architecture viewpoint. The internet is a wild place. Be aware that you'll encounter situations you would not think possible out there. Individual components should be able to handle them. On an architectural level, we don't think there is a need to support more than usual robustness against missing or broken components of the crawler, as long as the system is extensible.
- observability: If things go wrong, both functionally and performance-wise, we want to quickly locate problematic points
The current architectural model is service-oriented, networked and event-driven. Multiple modules act on their own interests. The primary communication between modules happens over queues. Other modules can react to these events in their own processing and as a result announce new events.
The end goal of a crawler is to collect data, so all crawl results are written into a shared datastore, which allows easier manual consumption of the data. Each module has its own schema. Over time, we should make clear distinctions between temporary, internal module state and stable, consumable data.
content-crawler
(Java / Spring Boot): Content-related crawling. Interacts over SQS with muppets to fetch and retrieve content. Stores the links to relevant S3 files in the DBdispatcher
(Java / Spring Boot): Spring Boot app that serves as API entrypoint to the crawler system and will copy messages to the input queues of all modules that need it.dns-crawler
(Java / Spring Boot): DNS-related crawling. Fetch DNS records, post-process with geo-ip.feature-extraction
(Java / Spring Boot): extracts features from data collected by other modules. For now only extracts features from the HTML.ground-truth
(Java / Spring Boot): Ground-truth only creates tables used for labeling the mercator visits.mercator-api
(Java / Spring Boot): Expose API via Spring Data REST, using the JPA entities of other modules.mercator-ui
(NodeJS / TypeScript / React): Basic UI for Mercatormercator-wappalyzer
(NodeJS / TypeScript): Wrapper around Wappalyzer, using to detect technologies in websites.muppets
(NodeJS / TypeScript): Takes screenshots via puppeteer, interacts over SQS and stores results in S3observability
: code to support the observability property of the systemsmtp-crawler
(Java / Spring Boot): Stores information about SMTP servers.tls-crawler
(Java / Spring Boot): Gathers TLS certificates and extracts data.vat-crawler
(Java / Spring Boot): Tries to find VAT number (belgian only for now) on a website.mercator-cli
(Java / Spring Boot): Command-line interface to interact with the SQS queues of the different modules.
All modules currently write into a Postgres SQL database. See Data Catalog for more details.
Be aware that running Mercator might trigger some alarms to the crawled domains' servers. Some modules open a lot of parallel connections and might be seen as scanning tool or even DoS attack. We plan to implement rate-limits and/or cache-lookups when that is possible. Keep that in mind when changing configuration values such as timeouts, ratelimits, max value for Horizontal Pod Autoscaling (HPA), etc.
DNS Belgium is not responsible for any complaints you could receive in the result of using Mercator.
- Java 17
- Docker & Docker Compose
- Helm
Builds with gradle, bundled as gradle wrapper. Uses Spring Boot and Spring Cloud AWS/SQS/Spring messaging/Spring JMS.
We currently use LocalStack locally and AWS SQS in the cloud for cloud stream implementation, and Postgres as data store implementation.
In order to use the Geo IP capacity in the crawler, you need to register and generate a key
at Maxmind. You can then enable
GeoIP in dns-crawler
and smtp-crawler
and put you license key there. (see properties files).
Run start-all.sh
to start the entire system using docker-compose.
Once all services are running, start a crawl by sending a message to the dispatcher input queue:
aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dispatcher-input --message-body '{"domainName": "abc.be"}'
or if you want a label, use
aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dispatcher-input --message-body '{"domainName": "abc.be", "labels": ["some-label"]}'
After some time, results should start appearing in the database. You can surf to http://localhost:5050/?pgsql=db&username=postgres&db=postgres to inspect the database using the bundled database viewer.
It's best to always go via the Dispatcher. But if you really want to, you can send a message directly to the input queue of a single module. You have to add a valid visit_id in that case:
aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-content-crawler-input --message-body '{"visitId":"f1c25d2f-3248-4013-a73a-307f91ee014b", "domainName": "abc.be"}'
Or
aws --endpoint-url=http://localhost:4566 sqs send-message --queue-url http://localhost:4566/queue/mercator-dns-crawler-input --message-body '{"visitId":"f1c25d2f-3248-4013-a73a-307f91ee014a", "domainName": "abc.be"}'
For local development, it makes more sense to run our services outside the docker network, directly on our machine.
Run start-deps.sh
to start only the dependencies.
Use the 'local' profile to start individual components.
The local profile (Spring) by default uses the following ports
- 8082 : DNS crawler
- 8083 : SMTP crawler
- 8084 : content crawler
- 8085 : muppets
- 8086 : dispatcher
- 8087 : wappalyzer
- 8088 : feature-extraction
- 8089 : mercator-api
- 8090 : mercator-ui
- 8091 : vat-crawler
- 8092 : tls-crawler
Or use intellij to run/debug one of the Application classes. Do not forget to specify the local profile in this case.
- Prometheus : http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Mercator is built to run on Kubernetes. This allows to scale modules of mercator easily and according to the load.
The build of the containers for the Spring Boot apps is done using jib. The other modules uses Docker cli to build the container.
Each module has a gradle task dockerBuild
that will create the docker container.
Each module has its Helm chart. We use Helm to deploy the app in Kubernetes.
It is possible to locally create a Kubernetes cluster with docker-desktop
Infrastructure as code is stored in another git repo.
TODO
If you are interested in contributing to Mercator:
- Start by reading our contributing guide.
- Navigate our codebase and open issues.
We are thankful for all the contributions and feedback we receive.
This tool is initially developed by DNS Belgium.
This version of Mercator is released under the Apache License, Version 2.0 (see LICENSE.txt).