Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup is Triggered with same backupNumber #962

Closed
pniederlag opened this issue Jan 26, 2024 · 9 comments
Closed

Backup is Triggered with same backupNumber #962

pniederlag opened this issue Jan 26, 2024 · 9 comments
Labels
bug Something isn't working not-stale release-pending Issue that has been fixed in main not yet released

Comments

@pniederlag
Copy link

Due to the size of the backup we only trigger a backup every two hours (interval: 7200).

For some - yet unknown - reasons in some cases the backup is being triggered twice in a very short time. Our backup script by now detects that problem and will exit (without error).

Find an excerpt from the logs below.

backups 1284 and 1287 did work according to expectations.
backups 1285, 1286, 1288 und 1289 have been run twice

We have already slightly adapted the backup and restore scripts in order to trace the issues. A crucial part might be the proper way of emitting and handling signals in the script.

grep -i backup /tmp/jenkins-operator.log | grep -E 'Performing|Skipping backup'
2024-01-25T23:06:03.961Z	INFO	controller-jenkins	Performing backup '1284'	{"cr": "prod"}
2024-01-25T23:06:55.151Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-25T23:35:22.977Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-25T23:35:23.758Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T00:38:18.545Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T00:38:19.286Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:06:03.918Z	INFO	controller-jenkins	Performing backup '1285'	{"cr": "prod"}
2024-01-26T01:06:55.557Z	INFO	controller-jenkins	Performing backup '1285'	{"cr": "prod"}
2024-01-26T01:06:56.787Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:06:57.705Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:35:37.510Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T01:35:38.172Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:32:46.358Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:32:47.077Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:33:58.747Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T02:33:59.429Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:06:03.917Z	INFO	controller-jenkins	Performing backup '1286'	{"cr": "prod"}
2024-01-26T03:06:55.142Z	INFO	controller-jenkins	Performing backup '1286'	{"cr": "prod"}
2024-01-26T03:06:56.532Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:06:57.253Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:32:40.676Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T03:32:41.379Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T04:29:58.425Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T04:29:59.113Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:06:05.006Z	INFO	controller-jenkins	Performing backup '1287'	{"cr": "prod"}
2024-01-26T05:06:56.149Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:06:56.870Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:30:03.700Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:30:04.440Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:31:51.007Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T05:31:52.746Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:22:14.584Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:22:16.399Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:28:27.716Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T06:28:28.687Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:06:03.963Z	INFO	controller-jenkins	Performing backup '1288'	{"cr": "prod"}
2024-01-26T07:06:55.935Z	INFO	controller-jenkins	Performing backup '1288'	{"cr": "prod"}
2024-01-26T07:06:57.606Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:06:58.476Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:23:46.648Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:25:58.041Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T07:25:58.733Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T08:23:19.967Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T08:23:20.914Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:03:23.164Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:03:23.898Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:06:04.003Z	INFO	controller-jenkins	Performing backup '1289'	{"cr": "prod"}
2024-01-26T09:06:55.305Z	INFO	controller-jenkins	Performing backup '1289'	{"cr": "prod"}
2024-01-26T09:06:56.471Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:06:57.307Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:25:09.829Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:25:10.578Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:39:58.047Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}
2024-01-26T09:39:58.783Z	DEBUG	controller-jenkins	Skipping backup	{"cr": "prod"}

used backup script


  backup.sh: |
    #!/usr/bin/env bash

    set -eo pipefail
    #set -xv

    [[ ! $# -eq 1 ]] && echo "Usage: $0 backup_number" && exit 1;
    [[ -z "${BACKUP_DIR}" ]] && echo "Required 'BACKUP_DIR' env not set" && exit 1;
    [[ -z "${JENKINS_HOME}" ]] && echo "Required 'JENKINS_HOME' env not set" && exit 1;

    backup_number=$1
    echo "Running backup"

    LOCK_FILE="${BACKUP_DIR}/.backup-running"
    # debug logging to check for weird errors START
    echo "----------------------------------------" >> ${BACKUP_DIR}/debug.log
    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} START" >> ${BACKUP_DIR}/debug.log

    # 1. check: test LOCK_FILE to prevent concurrent backups
    if [[ -f "${LOCK_FILE}" ]]; then
      echo "$(date +'%F %H:%M:%S %Z') lock ${LOCK_FILE} exists, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      # ignore if older than one day
      if [ "$(find ${LOCK_FILE} -mtime +1)" ]; then
        echo "$(date +'%F %H:%M:%S %Z') ignoring ${LOCK_FILE} as its more than one day old" >> ${BACKUP_DIR}/debug.log
      else
        echo "$(date +'%F %H:%M:%S %Z') skipping backup because it is locked by ${LOCK_FILE}, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
        exit 0
      fi
    fi
    
    BACKUP_TMP_DIR=$(mktemp -d)
    # as we now work via tmp-dir it is very unlikely the backup already exists in final destination
    # yet we keep this check in here as an additional safety check
    # 2. check: this backup number already finished?
    if [[ -e ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "${backup_number} already existing, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EXISTS" >> ${BACKUP_DIR}/debug.log
      exit 0
    fi
    # debug logging to check for weird errors END

    # Start backup by setting the lock-file, prevent concurrent executions
    touch ${LOCK_FILE}
    # the user of the jenkins-operator will create a fresh/new directory on each installation
    # that's the reason we keep the directory of this special user out of the backup
    # we must catch exit code 1, as tar will exit with status 1 when files have been changed during backup
    tar -C ${JENKINS_HOME} --ignore-failed-read --no-wildcards-match-slash --anchored --exclude=jobs/seed-job-post-init --exclude=jobs/*/workspace* --exclude=jobs/*/config.xml --exclude=jobs/*/state.xml --exclude=users/jenkinsoperato_* --exclude=users/jenkinsdev_* --exclude=jobs/**/builds/[0-9]*/libs -czf "${BACKUP_TMP_DIR}/${backup_number}.tar.gz" jobs users userContent ||  [[ $? -eq 1 ]]
    cp ${BACKUP_TMP_DIR}/${backup_number}.tar.gz ${BACKUP_DIR}/${backup_number}.tar.gz

    # Cleanup of tmp dir and lock-file
    rm -rf ${BACKUP_TMP_DIR}
    rm -f ${LOCK_FILE}

    if [[ ! -s ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "backup file '${BACKUP_DIR}/${backup_number}.tar.gz' is empty"
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EMPTY" >> ${BACKUP_DIR}/debug.log
      exit 1
    fi

    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} DONE" >> ${BACKUP_DIR}/debug.log

    echo "Done"
    exit 0
@pniederlag pniederlag added the bug Something isn't working label Jan 26, 2024
@pniederlag
Copy link
Author

pniederlag commented Feb 1, 2024

Same thing is true for restoring the backup... btw... we see two consecutive runs

2024-02-01T13:37:48.329Z	INFO	controller-jenkins	Restoring backup '1386'	{"cr": "test"}
2024-02-01T13:38:43.901Z	INFO	controller-jenkins	Restoring backup '1386'	{"cr": "test"}

Why is this message triggered twice? and why does reconcilliation continue without waiting for the restore to finish?

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

@brokenpip3
Copy link
Collaborator

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

this is odd, can you send me the ps -ef inside the backup container? I never saw this issue with few seconds between 2 backups.
Can you also send the jenkins crd? remove the sensitive parts, only the backup config

@pniederlag
Copy link
Author

pniederlag commented Feb 5, 2024

Thx for taking a look into it... Here is the output of 'ps -ef'.

root@jenkins-test:/home/user/bin# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 10:39 ?        00:00:00 /bin/sh -c ./run.sh
root           7       1  0 10:39 ?        00:00:00 bash ./run.sh
root           9       7  0 10:39 ?        00:00:00 sleep 7200
root         148       0  0 10:42 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         162     148  0 10:42 ?        00:00:00 sleep 900
root         302       0  0 10:43 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         316     302  0 10:43 ?        00:00:00 sleep 900
root         332       0  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         339     332  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         340     339  0 10:49 pts/0    00:00:00 bash
root         341     340  0 10:49 pts/0    00:00:00 ps -ef

and the crd

jenkins-test-yaml.txt

scripts are adapted and mounted from our configmap.

It might be worth mentioning that we have PROD and TEST running in the very same cluster on different namespaces... (operator is duplicated as well and each one only watching its own namespace)

@github-actions github-actions bot added the stale label Apr 6, 2024
@brokenpip3
Copy link
Collaborator

This is clearly a new instance of the operator that is trying to restore the same backup, we may need to add a more safe mechanism that will avoid the operator to run again the restore command and the backup to avoid to start if another is running.
Will do somewhere in the near future

@DionJones615
Copy link

DionJones615 commented Jun 20, 2024

We've also seen duplicated restores, but in this case it's every time on every instance and they appear to be consecutive, not concurrent.

This PR fixed it for us.
Unsure if this is related, but worth a look. #1021

After giving it a second look, I think the backup issue occurs if jenkins-operator restarts. Similar to the restore issue I discovered, the .Update() function can fail if the client is not defined yet. Similar to my fix for Restore, I added the same .Get() for Backup.

@brokenpip3
Copy link
Collaborator

brokenpip3 commented Jun 22, 2024

Thanks a lot for the restore fix!
However for the double backup/restore it's more complicated than this. The problem is that with the current logic one of the first action will be to run the backup script in jenkins with a simple kube exec into the pod. With this logic if the operator will crash/respawn at the same time after the restart it will be not able to understand if there is a backup process that is still in running in the jenkins pod.
This could be fixed at 2 level:

  • Operator logic, we can note the backup pid in the crd status and check for it each time before updating the backup status/running a new backup
  • Backup script logic: we can use a semaphore of any type (file for instance) to avoid run another backup and instead wait until the old one will end or goes to error etc (it will be important to add this part to the bash script trap). Finally after waiting x times for y seconds it will goes in error (to avoid waiting forever)

Honestly based on the operator code I'm more incline to add this logic to the backup script, since the original authors of this operator make the operator to be agnostic of the type and logic of the backup.
It should be not so complex, I can try in the following days

@brokenpip3
Copy link
Collaborator

brokenpip3 commented Jun 22, 2024

I started the work on this PR

you can test it with this temporary image: quay.io/jenkins-kubernetes-operator/backup-pvc:7a4cbf98

@brokenpip3
Copy link
Collaborator

Ok I think that now is stable to test: quay.io/jenkins-kubernetes-operator/backup-pvc:5f5c8e17 any feedback will be appreciated, I will add some docs before merging the PR.

@brokenpip3 brokenpip3 added the release-pending Issue that has been fixed in main not yet released label Jul 2, 2024
@brokenpip3
Copy link
Collaborator

The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working not-stale release-pending Issue that has been fixed in main not yet released
Projects
None yet
Development

No branches or pull requests

3 participants