diff --git a/docs/logs-ovh3.md b/docs/logs-ovh3.md index 664d42f9..8d06dd31 100644 --- a/docs/logs-ovh3.md +++ b/docs/logs-ovh3.md @@ -3,6 +3,25 @@ Report here the timeline of incidents and interventions on ovh3 server. Keep things short or write a report. +## 2024-10-31 system taking 100% CPU + +* Server is not accessible via SSH. +* Munin show 100% CPU taken by system. +* We ask for a hard reboot on OVH console. +* After restart system continues to use 100% CPU. +* Top shows that arc_prune + arc_evict are using 100% CPU +* exploring logs does not show strange messages +* `cat /proc/spl/kstat/zfs/arcstats|grep arc_meta` shows a arc_meta_used < arc_meta_max (so it's ok) +* We soft reboot the server +* It is back to normal + +## 2024-10-10 + +sda on ovh3 is faulty (64 Current_Pending_Sector, 2 Reallocated_Event_Count). +See https://github.com/openfoodfacts/openfoodfacts-infrastructure/issues/424 + + + ## 2023-12-05 certificates for images expired Images not displaying anymore on the website due to SSL problem (signaled by Edouard, with alert by blackbox exporter) diff --git a/docs/proxmox.md b/docs/proxmox.md index ce056078..dee9ac8c 100644 --- a/docs/proxmox.md +++ b/docs/proxmox.md @@ -6,8 +6,31 @@ On ovh1 and ovh2 we use proxmox to manage VMs. ## Proxmox Backups -Every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset -(see Datacenter -> backup) +**IMPORTANT:** We don't use standard proxmox backup[^previous_backups] (see Datacenter -> backup). + +Instead we use [syncoid / sanoid](./sanoid.md) to snapshot and synchronize data to other servers. + +[^previous_backups]: Previously every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset + +## Storage synchronization + +We don't use standard proxmox replication of storages, because it is incompatible with using [syncoid / sanoid](./sanoid.md), as it removes snapshots on destination and does not allow to choose destination location. + +It means that restoring a container / VM won't be automatic and will need a manual intervention. + +### Replication (don't use it) + +Previously, VM and container storage were regularly synchronized to ovh3 (and eventually to ovh1/2). + +Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM. + +This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication) + +* To Add replication a replication on a container / VM: + * In the Replication menu of the container, "Add" one + * Target: the server you want + * Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS) + ## Host network configuration @@ -117,22 +140,16 @@ At OVH we have special DNS entries: * `proxy1.openfoodfacts.org` pointing to OVH reverse proxy * `off-proxy.openfoodfacts.org` pointing to Free reverse proxy -## Storage synchronization - -VM and container storage are regularly synchronized to ovh3 (and eventually to ovh1/2) to have a continuous backup. - -Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM. - -This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication) - ## How to migrate a container / VM You may want to move containers or VM from one server to another. -Just go to the interface, right click on the VM / Container and ask to migrate ! +**FIXME** this will not work with sanoid/syncoid. + +~~Just go to the interface, right click on the VM / Container and ask to migrate !~~ -If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration. +~~If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration.~~ ## How to Unlock a Container @@ -254,11 +271,6 @@ Using the web interface: * Start at boot: Yes * Protection: Yes (to avoid deleting it by mistake) -* Eventually Add replication to ovh3 or off1/2 (if we are not using sanoid/syncoid instead) - * In the Replication menu of the container, "Add" one - * Target: ovh3 - * Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS) - Also think about [configuring email](./mail.md#postfix-configuration) in the container ## Logging in to a container or VM diff --git a/docs/reports/2024-10-30-ovh3-backups.md b/docs/reports/2024-10-30-ovh3-backups.md new file mode 100644 index 00000000..324ba01a --- /dev/null +++ b/docs/reports/2024-10-30-ovh3-backups.md @@ -0,0 +1,87 @@ +# 2024-10-30 OVH3 backups + +We need an intervention to change a disk on ovh3. + +We still have very few backups for OVH services. + +Before the operation, I want to at least have replication of OVH backups on the new MOJI server. + +We previously tried to do it while keeping replication, but it does not work well. + +So here is what we are going to do: +* remove replication and let sanoid / syncoid deal with replication to ovh3 + * we will have less snapshots on the ovh1/ovh2 side and we will use a replication snapshot + to avoid relying on a common existing snapshot made by syncoid + +Note that we don't replicate between ovh1 and ovh2 because we have very few space left on disks. + +## Changing sanoid / syncoid config and removing replication + +First because we won't use replication anymore, we have to create the ovh3operator on ovh1 and ovh2, +and as we want to use replication snapshot, we have to use corresponding rights for ZFS, +See [Sanoid / creating operator on PROD_SERVER](../sanoid.md#creating-operator-on-prod_server) + +I also add to link zfs command in /usr/bin: `ln -s /usr/sbin/zfs /usr/bin` + +For each VM / CT separately I did: +* disable replication +* I didn't have to change syncoid policy on ovh1/2 +* on ovh3 + * configure sanoid policy from replicated_volumes to a regular synced_data one + * use syncoid to sync the volume and configure add the line to syncoid-args (using a specific snapshot) + +I tried with CT 130 (contents) first. I tried the syncoid command manually: +```bash +syncoid --no-privilege-elevation ovh3operator@10.0.0.1:rpool/subvol-130-disk-0 rpool/subvol-130-disk-0 +``` +I did some less important CT and VM: 113, 140, 200 and decided to wait until next day to control everything is ok. + +The day after, I did the same for CT 101, 102, 103, 104,105, 106, 108, 202 and 203 on ovh1, +and 107, 110 and 201 on ovh2. + +I removed 109 120 + +I also removed sync of CT 107 to ovh1 (because we are nearly out of disk space) and removed the volume there. + +## Syncing between ovh1 and ovh2 + +We have two VMs that are really important to replicate (but using syncoid) from ovh1 to ovh2. + +So I [created an ovh2operator on ovh1](../sanoid.md#creating-operator-on-prod_server) + +I installed the syncoid systemd service and enabled it. +Same for `sanoid_check`. + +I added synced_data template to sanoid.conf to use it for synced volumes. +I removed volume 101 and 102 and manually synced them from ovh1 to ovh2. +Then I added them to `syncoid-args.conf`. + +## Removing dump backups on ovh1/2 + +We also decided to remove dump backups on ovh1/2/3. + +Going in proxmox interface, Datacenter, Backup, I disabled backups. + +## Removing replication snapshots + +On ovh1, ovh2, ovh3 I removed the __replicate_ snapshots. + +```bash +zfs list -r -t snap -o name rpool|grep __replicate_ +zfs list -r -t snap -o name rpool|grep __replicate_|xargs -n 1 zfs destroy +``` + +Also on osm45 +```bash +zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_ +zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_|xargs -n 1 zfs destroy +``` + +I did the same for vz_dump snapshots, as now backups are no more active. + + +## Checking syncs on osm45 (Moji) + +We would not need to use a sanoid specific snapshot on moji anymore, but I'll leaved it like it for now ! + +Syncs seems ok. \ No newline at end of file