In general, need to be careful with slurmdbd, probably running it in the foreground during upgrade to monitor progress. See
- Service break, making sure no jobs are running
- Stop slurmdbd
- Remove all slurm and munge packages: yum remove 'slurm' 'munge*'
- Install slurm server packages: yum install ohpc-slurm-server
- Start slurmdbd in the foreground: /sbin/slurmdbd -D -v
- Wait until the DB upgrade is completed (can take up to 45 mins)
- Stop the slurmdbd running in the foreground: Ctrl-C
- Start slurmdbd via systemd: systemctl start slurmdbd
- In ansible group_vars, set slurm_ohpc to "ohpc", run ansible-playbook install.yml --tags=fgci-install
- systemctl stop slurmctld
- Remove all slurm and munge packages: yum remove 'slurm' 'munge*'
- Install slurm server packages: yum install ohpc-slurm-server
- systemctl start slurmctld
- On compute nodes: systemctl stop slurmd
- Remove all slurm and munge packages: yum -y remove 'slurm' 'munge*'
- Install slurm client packages: yum -y install ohpc-slurm-client
- On compute nodes: systemctl start slurmd
To upgrade to a newer slurm OHPC version:
- Do steps 0-1 from previous list above
- Delete all the slurm and munge versionlock stuff from /etc/yum/pluginconf.d/versionlock.list
- In group_vars set slurm_ohpc_versionlock to False
- ansible-playbook install.yml --tags=fgci-install
- yum update
- Do steps 4-7 from the above list.
- Upgrade all the nodes per steps 9-16 except run "yum update" instead of remove+install.
- In group_vars set slurm_ohpc_versionlock to True and run this role to lock the version again.
Mostly the steps are done outside ansible, but it helps a bit.
Slurm 16.05 details:
Official upgrade documentation:
This guide is not a replacement for the official instructions.
It assumes you are using the playbooks where install.yml is to the service node and compute.yml to the compute nodes. It also assumes you are using the FGCI yum repo to fetch slurm packages.
- stop slurmdbd
- take backup with something like: /usr/local/sbin/ -o /outdir -z
- set ansible variable fgci_slurmrepo_version to "fgcislurm1605" (this will point the yum.repos/fgislurm.repo to the 1605 repo in group_vars/all
- run slurm role until task "Add FGI slurm repo" on install node
- ansible-playbook install.yml -t slurm -step
- answer Y on the setup task and "Add FGI slurm repo" tasks only, rest N. ctrl+C quit after the "FGI slurm repo" task is done
- yum update
- schema upgrade, slurmdbd -D
- ctrl+C when it says started
- systemctl daemon-reload
- systemctl start slurmdbd
- systemctl restart slurmctld
- start slurmdbd
- bash tools/ (to rsync group_vars to the internal web server for ansible-pull)
- ansible-playbook compute.yml -t slurm
- run yum update to update slurm
- pdsh -g compute -l root yum -y update
- optionally systemctl daemon-reload
- systemctl restart slurmd
- run the slurm role on the submit hosts (grid and login)
- ansible-playbook site.yml -t slurm -l login,grid
- yum update on login and grid node