In general, need to be careful with slurmdbd, probably running it in the foreground during upgrade to monitor progress. See http://slurm.schedmd.com/quickstart_admin.html#upgrade
- Service break, making sure no jobs are running
- Stop slurmdbd
- Remove all slurm and munge packages: yum remove 'slurm' 'munge*'
- Install slurm server packages: yum install ohpc-slurm-server
- Start slurmdbd in the foreground: /sbin/slurmdbd -D -v
- Wait until the DB upgrade is completed (can take up to 45 mins)
- Stop the slurmdbd running in the foreground: Ctrl-C
- Start slurmdbd via systemd: systemctl start slurmdbd
- In ansible group_vars, set slurm_ohpc to "ohpc", run ansible-playbook install.yml --tags=fgci-install
- systemctl stop slurmctld
- Remove all slurm and munge packages: yum remove 'slurm' 'munge*'
- Install slurm server packages: yum install ohpc-slurm-server
- systemctl start slurmctld
- On compute nodes: systemctl stop slurmd
- Remove all slurm and munge packages: yum -y remove 'slurm' 'munge*'
- Install slurm client packages: yum -y install ohpc-slurm-client
- On compute nodes: systemctl start slurmd
To upgrade to a newer slurm OHPC version:
- Do steps 0-1 from previous list above
- Delete all the slurm and munge versionlock stuff from /etc/yum/pluginconf.d/versionlock.list
- In group_vars set slurm_ohpc_versionlock to False
- ansible-playbook install.yml --tags=fgci-install
- yum update
- Do steps 4-7 from the above list.
- Upgrade all the nodes per steps 9-16 except run "yum update" instead of remove+install.
- In group_vars set slurm_ohpc_versionlock to True and run this role to lock the version again.
Mostly the steps are done outside ansible, but it helps a bit.
Slurm 16.05 details: http://slurm.schedmd.com/SLUG16/V16.05.pdf
Official upgrade documentation: http://slurm.schedmd.com/quickstart_admin.html#upgrade
This guide is not a replacement for the official instructions.
It assumes you are using the https://github.com/CSCfi/fgci-ansible playbooks where install.yml is to the service node and compute.yml to the compute nodes. It also assumes you are using the FGCI yum repo to fetch slurm packages.
- stop slurmdbd
- take backup with something like: /usr/local/sbin/dump-all-databases.sh -o /outdir -z
- set ansible variable fgci_slurmrepo_version to "fgcislurm1605" (this will point the yum.repos/fgislurm.repo to the 1605 repo in group_vars/all
- run slurm role until task "Add FGI slurm repo" on install node
- ansible-playbook install.yml -t slurm -step
- answer Y on the setup task and "Add FGI slurm repo" tasks only, rest N. ctrl+C quit after the "FGI slurm repo" task is done
- yum update
- schema upgrade, slurmdbd -D
- ctrl+C when it says started
- systemctl daemon-reload
- systemctl start slurmdbd
- systemctl restart slurmctld
- start slurmdbd
- bash tools/pullReqs.sh (to rsync group_vars to the internal web server for ansible-pull)
- ansible-playbook compute.yml -t slurm
- run yum update to update slurm
- pdsh -g compute -l root yum -y update
- optionally systemctl daemon-reload
- systemctl restart slurmd
- run the slurm role on the submit hosts (grid and login)
- ansible-playbook site.yml -t slurm -l login,grid
- yum update on login and grid node