Skip to content

Latest commit

 

History

History
67 lines (52 loc) · 6.83 KB

README.md

File metadata and controls

67 lines (52 loc) · 6.83 KB

Monitoring Deployment Scripts and Configuration Files (Nagios & Grafana) for Cosmos-Based Blockchains.

This repository contains configuration files for a Nagios 4 deployment, that would monitor one to multiple Cosmos/Tendermint-based validators:

  • Nagios server configuration files, which must be updated to match one's actual setup,

  • Remote server(s) setup and scripts (API, NRPE),

  • The remote hosts are monitored by querying a custom API (Python3/FastAPI) through Nagios' NRPE and getting some key parameters about the validator.

  • resize_volume.py is a script allowing to automatically increase the disk space for Hetzner and Digital Ocean, using their respective API.
    The relevant data (disk ids, API tokens, disk mount points) must be defined in another file named volume_data.py and both should be placed in /usr/local/nagios/libexec/

  • This works with Nagios and the different values such as the host and service names should match the hostnames from Nagios (in our case, host may be HETZNER-1, service may be Check Disk Space 2, and so on).

  • The prometheus_install.sh script is meant to install Prometheus and its systemd service to plug into Grafana for example.

WHY?

  • We intend to provide a means for the community to monitor their validators, accessible even to those without much technical knowledge. As such, the code is voluntarily kept as simple as it can possibly be -- it may be subject to changes, improvements and complexification, but we'll try to keep it easily readable and deployable.
  • The monitoring tool allows one to be alerted almost instantly whenever a problem arises, thus being able to quickly resolve an issue and in turn, improve the overall stability of the network.
    Avoiding being slashed is also a nice perk.
  • Why Nagios? This is clearly not the most modern of the monitoring solutions, however it is one of the easiest to understand and set up. It is also very low on resources: our initial deployment tests were on a Raspberry Pi 3, which handled the task without any problem. This can be installed on pretty much any Linux machine.

MONITORED METRICS

  • Disk space: WARNING at 6 Go left, CRITICAL at 3 Go. (by default, alerts are sent on CRITICAL state only)
  • The Python script creates API endpoints allowing to monitor the following items:
    • Validator Status: whether the node is running (OK) or not (CRITICAL).
    • Block Delay: delta between the node's block timestamp and the official timestamp. WARNING if above 2s (and usually it's about 0.1s max).
    • Missed Block: WARNING if the node missed a block. Displays the missed block number, 'N/A' otherwise. This metric also monitors that the block height is properly incrementing. Status become CRITICAL if the height does not increment for about 15 seconds.
    • Bonding Status: OK if the validator is bonded (= part of the active set), CRITICAL otherwise.

Other metrics and endpoints can easily be added in the script, and the corresponding services defined in Nagios.

INSTRUCTIONS - CLIENT SIDE

  • Update the items in config.sh then just run automated_install.sh.
  • You can also run prometheus_install.sh if you intend to use Prometheus (with Grafana for example).
  • A Grafana template dashboard is provided in the GRAFANA folder.
  • TERRA & INJECTIVE specific: if running an Oracle for Terra and a Peggo Orchestrator, these can be monitored as well. Update the addresses in lines 38 and 49 of cosmos_validators_monitoring.py.

INSTRUCTIONS - NAGIOS INSTALLATION AND CONFIGURATION

  • You need to install Nagios on a Linux host, along with the webserver (Apache by default, probably possible to use another if you're feeling adventurous).
    This tutorial is pretty straightforward: https://support.nagios.com/kb/article/nagios-core-installing-nagios-core-from-source-96.html
  • Once Nagios is installed, you can configure it using the templates provided in the etc and libexec directories. The default target location is /usr/local/nagios/.
  • IF YOU WISH TO RECEIVE ALERTS BY EMAIL: you also need to configure Postfix or sendmail on the Nagios server (note that you can also configure Nagios to send Discord/Telegram/SMS alerts).
    If you don't have a properly configured mail server with a certificate and everything, all the messages Nagios sends out may be flagged as spam.
    One (kind of lazy and ugly, yet quick and easy) option is to configure Postfix to send emails using a Gmail account as described here: https://www.linode.com/docs/guides/configure-postfix-to-send-mail-using-gmail-and-google-workspace-on-debian-or-ubuntu/
    This requires enabling the "Access for less secure apps" on this account, so better create a new account just for this.
  • IF YOU WISH TO RECEIVE DISCORD ALERTS:
    • In libexec with the NAGIOS folder:
      • The discord_XXX_alerts.py files are exacty what their names suggest. Update with your Discord webhook if you wish to receive such alerts. Otherwise, don't copy them and delete the user 'discord' and all mention of it in Nagios' /objects/*.cfg files.
  • IMPORTANT: by default, Nagios alerts are sent to Discord primarily. If a service/host is still in CRITICAL state after 2 notifications, then an escalation is triggered and alerts are sent by email.
    Do NOT overlook the email alerts. :)
  • You must update the provided files to match your local setup:
    -->htpasswd.users
    -->cgi.cfg
    -->All the files in /etc/objects/ except commands.cfg which should work as is.
    Verify that the config is correct with /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg then restart Nagios.
  • You must obviously ensure that the network configuration is fine (NRPE port open on the remote hosts in particular)

THE DISK RESIZE SCRIPT

  • An event_handler in Nagios allows to automatically resize Cloud volumes when they are close to be full. The disk space being the main cost when it comes to cloud, it's best not to buy storage that will remain unused for a long time.

  • The python script will use the Cloud service API to add 5G to a volume when it has only 3G left (just in case it fails, to give enough time to resolve manually).

  • Then it connects to the server in SSH and passes the command to expand the filesystem.

  • This is most definitely a security problem, although the SSH account that is used is a very limited one that can only execute this command (using rbash and sudoers) ; working on a more secure solution.

  • Don't hesitate to ping us on Discord: Thomas | High Stakes#0885 or Joe | High Stakes#0880