Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing configuration options and fix ansible lint errors #22

Draft
wants to merge 27 commits into
base: trunk
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1 @@
target/*
target/*
3 changes: 3 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
notes.txt
ansible.cfg
.yamlfmt
.run.sh
268 changes: 268 additions & 0 deletions ansible/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# LARD on OpenStack 2

## Get access to OpenStack

You need to create application credentials in the project you are going to
create the instances in, so that the ansible scripts can connect to the right
`ostack_cloud` (in our case it's `lard`).

The file should exist in `~/.config/openstack/clouds.yml`.
If you have MET access see what is written at the start of the readme [here](https://gitlab.met.no/it/infra/ostack-ansible21x-examples)
or in the authentication section [here](https://gitlab.met.no/it/infra/ostack-doc/-/blob/master/ansible-os.md?ref_type=heads).

## Dependencies

- Python 3.10+

- On your terminal run the following:

```terminal
python3 -m venv ~/.venv/lard
source ~/.venv/lard/bin/activate

pip install -r requirements.txt
ansible-galaxy collection install -fr requirements.yml
```

## Setup

### 1. Provision!

> [!IMPORTANT]
> Add your public key to the Ostack GUI.
> Go to "Compute" then "Key Pairs" and import your public key for later use during this step.

The IPs associated to the hosts in `inventory.yml` should correspond to
floating IPs you have requested in the network section of the OpenStack GUI.
These IPs are stored in the `ansible_host` variables inside each `host_vars\host_name.yml`.

Private variables are encrypted with `ansible-vault` and stored inside different files per role in `group_vars/servers/vault`.
You can either decrypt them beforehand, or pass the `-J` flag to ansible when running the playbooks.
Passwords can be found in [CICD variables](https://gitlab.met.no/met/obsklim/bakkeobservasjoner/lagring-og-distribusjon/db-products/poda/-/settings/ci_cd).

```terminal
ansible-playbook -i inventory.yml -e key_name=... provision.yml
```

> [!NOTE]
> If the network has already been setup and you only need to rebuild the VMs, you can do so with
>
> ```terminal
> ansible-playbook -i inventory.yml -e key_name=... provision.yml --skip-tags network
> ```

### 2. Configure!

The floating IP (`fip`) being passed in here is the one that gets associated with the primary, and moved when doing a switchover.

> [!NOTE]
> The floating IP association times out, but this is ignored as it is a known bug.

```term
ansible-playbook -i inventory.yml -e fip=... -e db_password=... -e repmgr_password=... configure.yml
```

The parts to do with the floating IP that belongs to the primary (ipalias) are based on this [repo](https://gitlab.met.no/ansible-roles/ipalias/-/tree/master?ref_type=heads).

#### SSH into the VMs

It might be helpful to create host aliases and add them to your `~/.ssh/config` file,
so you don't have to remember the IPs by heart. An example host alias looks like the following:

```ssh
Host lard-a
HostName 157.249.*.*
User ubuntu
```

Then run:

```terminal
ssh lard-a
PGPASSWORD=... psql -h localhost -p 5432 -U lard_user -d lard
```

> [!NOTE]
> You can also connect from your computer, but
> unfortunately the ssh alias does not work for psql.
> You can define a separate service inside `~/.pg_service.conf`
>
> ```
> [lard-a]
> host=157.249.*.*
> port=5432
> user=lard_user
> dbname=lard
> password=...
> ```
>
> And then
>
> ```terminal
> psql service=lard-a
> ```

#### Checking the status of the cluster

After `ssh`ing on the server and becoming postgres user (`sudo su postgres`), you can check the repmgr status with:

```terminal
postgres@lard-a:/home/ubuntu$ repmgr -f /etc/repmgr.conf node check
Node "lard-a":
Server role: OK (node is primary)
Replication lag: OK (N/A - node is primary)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: OK (N/A - node is primary)
Downstream servers: OK (1 of 1 downstream nodes attached)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/mnt/ssd-data/16/main")
```

```terminal
postgres@lard-b:/home/ubuntu$ repmgr -f /etc/repmgr.conf node check
Node "lard-b":
Server role: OK (node is standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: OK (node "lard-b" (ID: 2) is attached to expected upstream node "lard-a" (ID: 1))
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/mnt/ssd-data/16/main")
```

While a few of the configurations are found in `/etc/postgresql/16/main/postgresql.conf`, many of them
can only be seen in `/mnt/ssd-data/16/main/postgresql.auto.conf` (need `sudo` to see contents).

### 3. Deploy LARD

This is as simple as running

```terminal
ansible-playbook -i inventory.yml deploy.yml
```

### 4. Teardown

> [!TODO]
> This should be automated if possible

If you need to delete the old VMs (Compute -> Instances) and Volumes (Volumes
-> Volumes) you can do so in the OpenStack GUI.

> [!CAUTION]
> When deleting things to build up again, if for some reason one of the IPs
> does not get disassociated properly, you have to do it manually from the GUI (Network -> Floating IPs).

## Switchover

### 1. Planned maintenance

This should only be used when both VMs are up and running, like in the case of planned maintenance on one data room.
You can use this script to switch the primary to the data room that will stay available ahead of time.

**Make sure you are aware which one is the master, and put the names the right way around in this call.**

> **TODO**: This should be automated

```
ansible-playbook -i inventory.yml -e primary=... -e standby=... -e fip=... switchover.yml
```

This should also be possible to do manually, you might need to follow what is done in the ansible script (aka restarting postgres on both VMs),
then performing the switchover (as the `postgres` user):

```terminal
repmgr standby switchover -f /etc/repmgr.conf --siblings-follow
```

### Promote standby (assuming the primary is down)

This is used in the case where the primary has gone down (e.g. unplanned downtime of a data room).
Make sure you are know which one you want to promote!

**Manually:**

1. `ssh` into the standby

1. Check the status

```terminal
repmgr -f /etc/repmgr.conf cluster show
```

The primary should say its **uncreachable**

1. Then promote the standby to primary (while `ssh`-ed into the standby VM)

```terminal
repmgr -f /etc/repmgr.conf standby promote
```

1. You can the check then status again (and now the old primary will say **failed**)

1. Then move the ip in the OpenStack GUI (see in network -> floating ips, dissasociate it then associated it with the ipalias port on the other VM)

#### Later, when the old primary comes back up

The cluster will be in a slightly confused state, because this VM still thinks its a primary (although repmgr tells it the other one is running as a primary as well). If the setup is running as asynchronous we could lose data that wasn't copied over before the crash, if running synchronously then there should be no data loss.

SSH into the new primary
`repmgr -f /etc/repmgr.conf cluster show`
says:

- node "lard-a" (ID: 1) is running but the repmgr node record is inactive

SSH into the old primary
`repmgr -f /etc/repmgr.conf cluster show`
says:

- node "lard-b" (ID: 2) is registered as standby but running as primary

With a **playbook** (`rejoin_ip` is the ip of the primary node that has been down and should now be a standby):

```
ansible-playbook -i inventory.yml -e rejoin_ip=... -e primary_ip=... rejoin.yml
```

Or **manually**:
Make sure the pg process is stopped (see fast stop command) if it isn't already

Become postgres user:
`sudo su postgres`
Test the rejoin (host is the IP of the new / current primary, aka the other VM)
`repmgr node rejoin -f /etc/repmgr.conf -d 'host=157.249.*.* user=repmgr dbname=repmgr connect_timeout=2' --force-rewind=/usr/lib/postgresql/16/bin/pg_rewind --verbose --dry-run`
Perform a rejoin
`repmgr node rejoin -f /etc/repmgr.conf -d 'host=157.249.*.* user=repmgr dbname=repmgr connect_timeout=2' --force-rewind=/usr/lib/postgresql/16/bin/pg_rewind --verbose`

### for testing:

Take out one of the replicas (or can shut off instance in the openstack GUI):
`sudo pg_ctlcluster 16 main -m fast stop`
For bringing it back up (or turn it back on):
`sudo pg_ctlcluster 16 main start`

### Load balancing

This role creates a user and basic db for the loadbalancer to test the health of the db. Part of the role is allowed to fail on the secondary ("cannot execute \_\_\_ in a read-only transaction"), as it should pass on the primary and be replicated over. The hba conf change needs to be run on both.

The vars are encrypted, so run: ansible-vault decrypt roles/bigip/vars/main.yml

Then run the bigip role on the VMs:

```
ansible-playbook -i inventory.yml -e bigip_password=... bigip.yml
```

### Links:

https://www.enterprisedb.com/postgres-tutorials/postgresql-replication-and-automatic-failover-tutorial#replication

#### Useful ansible commands:

```terminal
ansible-inventory -i inventory.yml --graph

ansible servers -m ping -u ubuntu -i inventory.yml
```
49 changes: 2 additions & 47 deletions ansible/bigip.yml
Original file line number Diff line number Diff line change
@@ -1,52 +1,7 @@
- name: Copy schema for bigip
vars:
ostack_cloud: lard
ostack_region: Ostack2-EXT
hosts: localhost # need to seperate this since done from localhost
gather_facts: false
pre_tasks:
# copy file, so we have an .sql file to apply locally
- name: Create a directory if it does not exist
ansible.builtin.file:
path: /etc/postgresql/16/db/bigip
state: directory
mode: '0755'
become: true
delegate_to: '{{ hostvars[groups["servers"][0]].ansible_host }}'
remote_user: ubuntu
- name: Copy the schema to the remote 1
ansible.builtin.copy:
src: ./roles/bigip/vars/bigip.sql
dest: /etc/postgresql/16/db/bigip/bigip.sql
mode: '0755'
become: true
delegate_to: '{{ hostvars[groups["servers"][0]].ansible_host }}'
remote_user: ubuntu
- name: Create a directory if it does not exist
ansible.builtin.file:
path: /etc/postgresql/16/db/bigip
state: directory
mode: '0755'
become: true
delegate_to: '{{ hostvars[groups["servers"][1]].ansible_host }}'
remote_user: ubuntu
- name: Copy the schema to the remote 2
ansible.builtin.copy:
src: ./roles/bigip/vars/bigip.sql
dest: /etc/postgresql/16/db/bigip/bigip.sql
mode: '0755'
become: true
delegate_to: '{{ hostvars[groups["servers"][1]].ansible_host }}'
remote_user: ubuntu

---
- name: Create what is needed for the bigip load balancers
hosts: servers
remote_user: ubuntu
vars:
ostack_cloud: lard
ostack_region: Ostack2-EXT
gather_facts: false
# loops over both servers
roles:
- role: bigip
# will fail to create table in the standby (since read only)
- role: bigip
Loading
Loading