1. Description
2. Initial Deployment
3. Updating Deployment
4. Removing Deployment
5. Changelog
6. Known Issues
7. Planned For Later
Appendix. Directory Structure
Containerized Prometheus & Grafana installation with Docker on Ubuntu Linux, packaged as a Terraform IaC project with Ansible deployment automation.
Designed as a single-instance monitoring & visualization solution (on AWS EC2) that can be configured to collect metrics from other systems (multi-cloud VMs & containers) via Prometheus HTTP pull. Collected metrics & syntethic alerts are then visualized on Grafana dashboards, which can be accessed through the co-hosted Nginx proxy server.
- An AWS account (with administrative rights to perform step #2.2.1)
- Following packages & dependencies to be installed on the local terminal (or a cloud-based IDE such as AWS Cloud9)
release | |
---|---|
AWS CLI | >= 2.11 |
Terraform | >= 1.5.5 |
Git | >= 2.42.0 |
Python3 | >= 3.9.6 |
Pip3 | >= 23.3.2 |
Boto3 | >= 1.34.23 |
Ansible | >= 2.15.8 |
2.2.1. Go to AWS Console & create an IAM user for the infrastructure automation tasks;
- It is strongly recommended by AWS not to create access keys for the root account.
AWS Console > IAM > Users:
- Create User > Specify User Name
- Set Permissions > Attach Policies Directly > Choose 'Administrator Access'
- Create Access Key > Other
- 'Administrator Access' may be replaced with a custom least-privilege permissions policy in a future release.
2.2.2. Configure AWS CLI environment on the local terminal (or cloud-based IDE) with the access keys obtained from #2.2.1;
aws configure
- Follow the prompts to configure AWS Access Key ID and the Secret Access Key.
2.2.3. Clone the remote repository onto the local terminal and change working directory;
git clone https://github.com/onur-zengin/aws-ec2-linux-docker.git
cd aws-ec2-linux-docker/
2.2.4. Create an RSA key pair in your home directory and copy the public key aws_linux.pub
here under the ./keys
directory;
ssh-keygen -t rsa -m PEM -f ~/.ssh/aws_linux
chmod 400 ~/.ssh/aws_linux
cp ~/.ssh/aws_linux.pub ./keys
- The key pair will be used for SSH access to the EC2 instance later.
2.2.5. Execute the Ansible playbook to deploy the Terraform infrastructure;
ansible-playbook ansible-deploy.yml -i localhost,
-
Do not skip the trailing comma (,) after localhost
-
Specify an AWS deployment region at the prompt or click enter to accept default (eu-central-1)
-
Collect the HOST_IP_ADDRESS from the output of step #2.2.5,
-
And try the following URLs on a web browser;
http://[HOST_IP_ADDRESS]/prom
http://[HOST_IP_ADDRESS]/graf
-
Grafana will be installed with the default admin password:
admin
. You may use the Grafana web interface or procedure #3.4 below to change it. -
If / when you also complete procedure #3.2 (optional), then the proxy server will redirect connection attempts to secure (HTTPS) URLs instead.
3.1.1. Install the Prometheus node_exporter binary on the target hosts and make sure it is running;
release | |
---|---|
node_exporter | >= 1.6.1 |
- Sample installation procedure for Ubuntu Linux 22 - x64;
sudo su -
useradd pne
mkdir -p /usr/local/bin/prometheus_ne
cd /usr/local/bin/prometheus_ne
wget -q https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar -xzvf node_exporter-1.6.1.linux-amd64.tar.gz
chown -R pne /usr/local/bin/prometheus_ne
cd node_exporter-1.6.1.linux-amd64/
su pne -c "./node_exporter --web.listen-address 0.0.0.0:9100 &"
3.1.2. Verify that the node-exporter on the target host is responsive;
curl http://[TARGET_IP_ADDRESS]:9100/metrics
3.1.3. Important: Make sure to update the AWS Security Group and / or other firewalls (if any) in front of the target host(s), to allow incoming connections on TCP port 9100 and only from the HOST_IP_ADDRESS that was printed in the output of step #2.2.5.
3.1.4. On the local terminal, edit configs/prometheus/prometheus.yml
to add the new targets to the configuration as applicable.
3.1.5. Apply changes;
ansible-playbook ansible-update.yml -i localhost,
- By design; changes made to configuration files will trigger the EC2 instance to be re-created, while its static IP address and application data are persisted.
- Note: If / when working with a large number of targets, these steps may also be automated with Ansible.
3.2.1. Go to your DNS zone configuration and create an A record for the static IP address;
DOMAIN_NAME A HOST_IP_ADDRESS
vmon.foo.com A XX.XX.XX.XX
3.2.2. Obtain a TLS certificate for the DOMAIN_NAME created above (note that you may also use an existing wildcard cert for the parent domain).
- Sample instructions for requesting a certificate from Let's Encrypt can be found at;
https://certbot.eff.org/
- Sample output;
sudo certbot certonly --dns-route53 -d *.foo.com
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for *.foo.com
Successfully received certificate.
Certificate is saved at: /etc/letsencrypt/live/foo.com/fullchain.pem
Key is saved at: /etc/letsencrypt/live/foo.com/privkey.pem
This certificate expires on 2024-04-21.
These files will be updated when the certificate renews.
3.2.3. Upload the TLS certificate to AWS Secrets Manager;
chmod +x ./scripts/putSecrets.py
sudo ./scripts/putSecrets.py PATH_TO_PEM_FILES DOMAIN_NAME AWS_REGION
-
The Python script will look for
fullchain.pem
andprivkey.pem
inside the specified path and upload them to AWS Secrets Manager. Subsequently, it will update the local configuration files for Docker & Nginx. -
Sample usage;
sudo ./scripts/putSecrets.py /etc/letsencrypt/live/foo.com vmon.foo.com eu-central-1
3.2.4. Apply changes;
ansible-playbook ansible-update.yml -i localhost,
- By design; changes made to configuration files will trigger the EC2 instance to be re-created, while its static IP address and application data are persisted.
3.3.1. On the local terminal, edit configs/prometheus/alertrules.yml
as necessary.
3.3.2. Apply changes;
ansible-playbook ansible-update.yml -i localhost,
- By design; changes made to configuration files will trigger the EC2 instance to be re-created, while its static IP address and application data are persisted.
- Note: For convenience, these steps may be replaced with a CI/CD pipeline.
3.4.1. Connect to the EC2 instance;
ssh -i ~/.ssh/aws_linux ubuntu@[HOST_IP_ADRESS]
3.4.2. Prometheus;
tbc
3.4.3. Grafana;
sudo docker exec -u root $(docker ps | grep graf | awk {'print $1'}) grafana cli admin reset-admin-password [NEW_PASSWORD]
Same as #2.1
- 4.2.1. Execute the following Ansible playbook to destroy the Terraform infrastructure;
ansible-playbook ansible-destroy.yml -i localhost,
- The command will look for the AWS region information and S3 bucket names inside
ansible-state.json
which was auto-created during the initial deployment.
- 4.2.2. If you had also uploaded your TLS certificate to AWS Secrets Manager (as shown in #3.2), then remove it with the following command;
aws secretsmanager delete-secret --secret-id cert-encoded --force-delete-without-recovery --region [AWS_REGION]
n/a
n/a
- Add email alerting
- Optimize memory usage on the Prometheus host
- Add authentication to Prometheus web interface
- Define least-privilege permissions for the AWS IAM policies
- Add Grafana deep-links to worldmap dashboard
- Add EBS Data Snapshot & Backup
- Complete the demo_fargate module to test container monitoring
- Automate TLS certificate renewal
.
├── configs
│ ├── docker
│ │ ├── compose.yml # Sets up Docker bridge network and container runtime
│ │ ├── daemon.json # Sets Docker root directory on the EBS drive (for data persistence)
│ ├── grafana
│ │ ├── db_worldmap.json # Dashboard configuration (World Map view)
│ │ ├── db_syscharts.json # Dashboard configuration (CPU, Mem, Disk, & NW-interface utilization charts)
│ │ ├── geo.json # Geo-coordinates of AWS regions for visualization purposes on Grafana dashboard
│ ├── nginx
│ │ ├── nginx_http.conf # Basic (non-secure) web server configuration (used when TLS cert not found)
│ │ ├── nginx.conf # Secure web server configuration
│ ├── prometheus
│ │ ├── alertrules.yml
│ │ ├── prometheus.yml # Main configuration file for Prometheus (including targets)
│ │ ├── records.yml # Frequently queried metrics to pre-populate TSDB
├── images # Logo images to differentiate the vmon host from other nodes on the world map dashboard
│ ├── logo_base.svg
│ ├── logo_alert.svg
├── keys
│ ├── aws_linux.pub # SSH public key file for remote access to the main EC2 host (not included)
├── modules
│ │ ├── demo_ec2 # (optional) Demo module to setup EC2 VMs as synthetic targets for Prometheus
│ │ ├── demo_fargate # (optional) Demo module to setup Fargate Containers as synthetic targets for Prometheus
│ │ ├── grafana # Post-installation Grafana setup as Terraform IaC (deprecated & replaced with Ansible tasks)
├── policies # AWS IAM resource-based policies
│ ├── ec2_assumeRole.json
│ ├── ec2_getSecrets.json
│ ├── s3_bucketPolicy.json
├── scripts # Python scripts to upload & download TLS certs to & from AWS Secrets Manager
│ ├── getSecrets.py # Executed on EC2
│ ├── putSecrets.py # Executed on the local terminal
│ ├── requirements.txt # Python requirements for putSecrets.py
ansible-deploy.yml # Ansible playbook file to deploy Terraform IaC
ansible-destroy.yml # Ansible playbook file to destroy Terraform IaC
ansible-update.yml # Ansible playbook file to modify Terraform IaC
ansible-state.json # Local state file that will be auto-created during initial deployment
ansible.cfg # Ansible configuration with Python interpreter auto-detection disabled
backend.tf # Terraform remote backend on AWS S3 & DynamoDB
bootstrap.tf # Cloud-init configuration to upload files & install packages on EC2 instance during boot
demo.tf # Configuration settings for the demo setup (default: off)
LICENCE # MIT License
main.tf # Main Terraform IaC build
outputs.tf # Terraform outputs (modifications to this file will impact ansible-deploy procedure)
providers.tf # Terraform providers
README.md # This file
variables.tf # Environment variables for the main instance. Sub-modules' variables placed under their respective directories.