Backup is Triggered with same backupNumber #962

pniederlag · 2024-01-26T10:54:55Z

Due to the size of the backup we only trigger a backup every two hours (interval: 7200).

For some - yet unknown - reasons in some cases the backup is being triggered twice in a very short time. Our backup script by now detects that problem and will exit (without error).

Find an excerpt from the logs below.

backups 1284 and 1287 did work according to expectations.
backups 1285, 1286, 1288 und 1289 have been run twice

We have already slightly adapted the backup and restore scripts in order to trace the issues. A crucial part might be the proper way of emitting and handling signals in the script.

grep -i backup /tmp/jenkins-operator.log | grep -E 'Performing|Skipping backup'
2024-01-25T23:06:03.961Z	INFO	controller-jenkins	Performing backup '1284'	{"cr": "prod"}
2024-01-25T23:06:55.151Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-25T23:35:22.977Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-25T23:35:23.758Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T00:38:18.545Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T00:38:19.286Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:06:03.918Z	INFO	controller-jenkins	Performing backup '1285'	{"cr": "prod"}
2024-01-26T01:06:55.557Z	INFO	controller-jenkins	Performing backup '1285'	{"cr": "prod"}
2024-01-26T01:06:56.787Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:06:57.705Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:35:37.510Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:35:38.172Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:32:46.358Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:32:47.077Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:33:58.747Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:33:59.429Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:06:03.917Z	INFO	controller-jenkins	Performing backup '1286'	{"cr": "prod"}
2024-01-26T03:06:55.142Z	INFO	controller-jenkins	Performing backup '1286'	{"cr": "prod"}
2024-01-26T03:06:56.532Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:06:57.253Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:32:40.676Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:32:41.379Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T04:29:58.425Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T04:29:59.113Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:06:05.006Z	INFO	controller-jenkins	Performing backup '1287'	{"cr": "prod"}
2024-01-26T05:06:56.149Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:06:56.870Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:30:03.700Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:30:04.440Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:31:51.007Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:31:52.746Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:22:14.584Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:22:16.399Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:28:27.716Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:28:28.687Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:06:03.963Z	INFO	controller-jenkins	Performing backup '1288'	{"cr": "prod"}
2024-01-26T07:06:55.935Z	INFO	controller-jenkins	Performing backup '1288'	{"cr": "prod"}
2024-01-26T07:06:57.606Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:06:58.476Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:23:46.648Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:25:58.041Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:25:58.733Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T08:23:19.967Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T08:23:20.914Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:03:23.164Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:03:23.898Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:06:04.003Z	INFO	controller-jenkins	Performing backup '1289'	{"cr": "prod"}
2024-01-26T09:06:55.305Z	INFO	controller-jenkins	Performing backup '1289'	{"cr": "prod"}
2024-01-26T09:06:56.471Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:06:57.307Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:25:09.829Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:25:10.578Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:39:58.047Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:39:58.783Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}

used backup script


  backup.sh: |
    #!/usr/bin/env bash

    set -eo pipefail
    #set -xv

    [[ ! $# -eq 1 ]] && echo "Usage: $0 backup_number" && exit 1;
    [[ -z "${BACKUP_DIR}" ]] && echo "Required 'BACKUP_DIR' env not set" && exit 1;
    [[ -z "${JENKINS_HOME}" ]] && echo "Required 'JENKINS_HOME' env not set" && exit 1;

    backup_number=$1
    echo "Running backup"

    LOCK_FILE="${BACKUP_DIR}/.backup-running"
    # debug logging to check for weird errors START
    echo "----------------------------------------" >> ${BACKUP_DIR}/debug.log
    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} START" >> ${BACKUP_DIR}/debug.log

    # 1. check: test LOCK_FILE to prevent concurrent backups
    if [[ -f "${LOCK_FILE}" ]]; then
      echo "$(date +'%F %H:%M:%S %Z') lock ${LOCK_FILE} exists, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      # ignore if older than one day
      if [ "$(find ${LOCK_FILE} -mtime +1)" ]; then
        echo "$(date +'%F %H:%M:%S %Z') ignoring ${LOCK_FILE} as its more than one day old" >> ${BACKUP_DIR}/debug.log
      else
        echo "$(date +'%F %H:%M:%S %Z') skipping backup because it is locked by ${LOCK_FILE}, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
        exit 0
      fi
    fi
    
    BACKUP_TMP_DIR=$(mktemp -d)
    # as we now work via tmp-dir it is very unlikely the backup already exists in final destination
    # yet we keep this check in here as an additional safety check
    # 2. check: this backup number already finished?
    if [[ -e ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "${backup_number} already existing, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EXISTS" >> ${BACKUP_DIR}/debug.log
      exit 0
    fi
    # debug logging to check for weird errors END

    # Start backup by setting the lock-file, prevent concurrent executions
    touch ${LOCK_FILE}
    # the user of the jenkins-operator will create a fresh/new directory on each installation
    # that's the reason we keep the directory of this special user out of the backup
    # we must catch exit code 1, as tar will exit with status 1 when files have been changed during backup
    tar -C ${JENKINS_HOME} --ignore-failed-read --no-wildcards-match-slash --anchored --exclude=jobs/seed-job-post-init --exclude=jobs/*/workspace* --exclude=jobs/*/config.xml --exclude=jobs/*/state.xml --exclude=users/jenkinsoperato_* --exclude=users/jenkinsdev_* --exclude=jobs/**/builds/[0-9]*/libs -czf "${BACKUP_TMP_DIR}/${backup_number}.tar.gz" jobs users userContent ||  [[ $? -eq 1 ]]
    cp ${BACKUP_TMP_DIR}/${backup_number}.tar.gz ${BACKUP_DIR}/${backup_number}.tar.gz

    # Cleanup of tmp dir and lock-file
    rm -rf ${BACKUP_TMP_DIR}
    rm -f ${LOCK_FILE}

    if [[ ! -s ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "backup file '${BACKUP_DIR}/${backup_number}.tar.gz' is empty"
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EMPTY" >> ${BACKUP_DIR}/debug.log
      exit 1
    fi

    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} DONE" >> ${BACKUP_DIR}/debug.log

    echo "Done"
    exit 0

The text was updated successfully, but these errors were encountered:

pniederlag · 2024-02-01T13:41:35Z

Same thing is true for restoring the backup... btw... we see two consecutive runs

2024-02-01T13:37:48.329Z	INFO	controller-jenkins	Restoring backup '1386'	{"cr": "test"}
2024-02-01T13:38:43.901Z	INFO	controller-jenkins	Restoring backup '1386'	{"cr": "test"}

Why is this message triggered twice? and why does reconcilliation continue without waiting for the restore to finish?

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

brokenpip3 · 2024-02-04T10:49:00Z

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

this is odd, can you send me the ps -ef inside the backup container? I never saw this issue with few seconds between 2 backups.
Can you also send the jenkins crd? remove the sensitive parts, only the backup config

pniederlag · 2024-02-05T10:59:01Z

Thx for taking a look into it... Here is the output of 'ps -ef'.

root@jenkins-test:/home/user/bin# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 10:39 ?        00:00:00 /bin/sh -c ./run.sh
root           7       1  0 10:39 ?        00:00:00 bash ./run.sh
root           9       7  0 10:39 ?        00:00:00 sleep 7200
root         148       0  0 10:42 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         162     148  0 10:42 ?        00:00:00 sleep 900
root         302       0  0 10:43 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         316     302  0 10:43 ?        00:00:00 sleep 900
root         332       0  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         339     332  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         340     339  0 10:49 pts/0    00:00:00 bash
root         341     340  0 10:49 pts/0    00:00:00 ps -ef

and the crd

jenkins-test-yaml.txt

scripts are adapted and mounted from our configmap.

It might be worth mentioning that we have PROD and TEST running in the very same cluster on different namespaces... (operator is duplicated as well and each one only watching its own namespace)

brokenpip3 · 2024-04-16T10:32:10Z

This is clearly a new instance of the operator that is trying to restore the same backup, we may need to add a more safe mechanism that will avoid the operator to run again the restore command and the backup to avoid to start if another is running.
Will do somewhere in the near future

DionJones615 · 2024-06-20T17:33:03Z

We've also seen duplicated restores, but in this case it's every time on every instance and they appear to be consecutive, not concurrent.

This PR fixed it for us.
Unsure if this is related, but worth a look. #1021

After giving it a second look, I think the backup issue occurs if jenkins-operator restarts. Similar to the restore issue I discovered, the .Update() function can fail if the client is not defined yet. Similar to my fix for Restore, I added the same .Get() for Backup.

brokenpip3 · 2024-06-22T08:35:23Z

Thanks a lot for the restore fix!
However for the double backup/restore it's more complicated than this. The problem is that with the current logic one of the first action will be to run the backup script in jenkins with a simple kube exec into the pod. With this logic if the operator will crash/respawn at the same time after the restart it will be not able to understand if there is a backup process that is still in running in the jenkins pod.
This could be fixed at 2 level:

Operator logic, we can note the backup pid in the crd status and check for it each time before updating the backup status/running a new backup
Backup script logic: we can use a semaphore of any type (file for instance) to avoid run another backup and instead wait until the old one will end or goes to error etc (it will be important to add this part to the bash script trap). Finally after waiting x times for y seconds it will goes in error (to avoid waiting forever)

Honestly based on the operator code I'm more incline to add this logic to the backup script, since the original authors of this operator make the operator to be agnostic of the type and logic of the backup.
It should be not so complex, I can try in the following days

brokenpip3 · 2024-06-22T15:28:36Z

I started the work on this PR

you can test it with this temporary image: quay.io/jenkins-kubernetes-operator/backup-pvc:7a4cbf98

brokenpip3 · 2024-06-23T10:45:31Z

Ok I think that now is stable to test: quay.io/jenkins-kubernetes-operator/backup-pvc:5f5c8e17 any feedback will be appreciated, I will add some docs before merging the PR.

brokenpip3 · 2024-07-07T15:56:20Z

The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.

pniederlag added the bug Something isn't working label Jan 26, 2024

github-actions bot added the stale label Apr 6, 2024

brokenpip3 added not-stale and removed stale labels Apr 16, 2024

brokenpip3 mentioned this issue Jun 22, 2024

fix(backup): Duplicated backups & restores #1021

Merged

3 tasks

brokenpip3 added the release-pending Issue that has been fixed in main not yet released label Jul 2, 2024

brokenpip3 closed this as completed Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup is Triggered with same backupNumber #962

Backup is Triggered with same backupNumber #962

pniederlag commented Jan 26, 2024

pniederlag commented Feb 1, 2024 •

edited

Loading

brokenpip3 commented Feb 4, 2024

pniederlag commented Feb 5, 2024 •

edited

Loading

brokenpip3 commented Apr 16, 2024

DionJones615 commented Jun 20, 2024 •

edited

Loading

brokenpip3 commented Jun 22, 2024 •

edited

Loading

brokenpip3 commented Jun 22, 2024 •

edited

Loading

brokenpip3 commented Jun 23, 2024

brokenpip3 commented Jul 7, 2024

Backup is Triggered with same backupNumber #962

Backup is Triggered with same backupNumber #962

Comments

pniederlag commented Jan 26, 2024

pniederlag commented Feb 1, 2024 • edited Loading

brokenpip3 commented Feb 4, 2024

pniederlag commented Feb 5, 2024 • edited Loading

brokenpip3 commented Apr 16, 2024

DionJones615 commented Jun 20, 2024 • edited Loading

brokenpip3 commented Jun 22, 2024 • edited Loading

brokenpip3 commented Jun 22, 2024 • edited Loading

brokenpip3 commented Jun 23, 2024

brokenpip3 commented Jul 7, 2024

pniederlag commented Feb 1, 2024 •

edited

Loading

pniederlag commented Feb 5, 2024 •

edited

Loading

DionJones615 commented Jun 20, 2024 •

edited

Loading

brokenpip3 commented Jun 22, 2024 •

edited

Loading

brokenpip3 commented Jun 22, 2024 •

edited

Loading