You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is more for info than anything else... plus is some useful rubber ducking / help for anyone else encountering the same issue.
The run_ansible script exits with errors when building clusters on Oracle Linux. The issue is that the ansible package was updated in May 2022, and this now uses Python 3.8. Somehow the packaging of Ansible on Oracle Linux fails to include various important module, e.g. python-ldap, PyMySQL amongst others. This means that run_ansible is unable to run and exits with an error. This is now caught by the finish script, which gives the impression that the cluster is still being built, and so installation appears to hang.
I tried downgrading ansible to the old version, but this appears to have disappeared from ol8-appstream, and my attempt to install from epel (by disabling ol8-appstream) failed because of unresolvable dependencies.
The solution I found was, as root, to yum remove ansible and then run pip3.6 install ansible. This installed ansible against python 3.6 (which has all of the required modules) and placed the result in /usr/local/bin. I then updated the path in the run_ansible script to /usr/local/bin/ansible-playbook and this completed every stage but one. The security-updates role failed with the error
TASK [security_updates : Install security updates] *****************************
Wednesday 06 July 2022 16:53:26 +0000 (0:00:00.746) 0:05:19.368 ********
fatal: [mgmt.subnet.sharpphoenix.oraclevcn.com]: FAILED! => changed=true
cmd:
- dnf
- update
- -y
- --security
- --exclude
- kernel*
- --exclude
- slurm*
delta: '0:00:05.836997'
end: '2022-07-06 16:53:32.741248'
msg: non-zero return code
rc: 1
start: '2022-07-06 16:53:26.904251'
stderr: |-
Error:
Problem 1: package shim-x64-15.6-1.0.3.el8.x86_64 requires oracle(kernel-sig-key) >= 202204, but none of the providers can be installed
- cannot install the best update candidate for package shim-x64-15.3-1.0.3.x86_64
- package kernel-4.18.0-372.13.1.0.1.el8_6.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.308.7.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.308.9.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-debug-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
Problem 2: problem with installed package shim-x64-15.3-1.0.3.x86_64
- package grub2-efi-x64-1:2.02-123.0.4.el8_6.8.x86_64 conflicts with shim-x64 <= 15.3-1.0.3 provided by shim-x64-15.3-1.0.3.x86_64
- package shim-x64-15.6-1.0.3.el8.x86_64 requires oracle(kernel-sig-key) >= 202204, but none of the providers can be installed
- cannot install the best update candidate for package grub2-efi-x64-1:2.02-123.0.1.el8.x86_64
- package kernel-4.18.0-372.13.1.0.1.el8_6.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.308.7.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-5.4.17-2136.308.9.el8uek.x86_64 is filtered out by exclude filtering
- package kernel-uek-debug-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
stderr_lines: <omitted>
stdout: |-
Last metadata expiration check: 0:02:59 ago on Wed 06 Jul 2022 04:50:29 PM GMT.
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
stdout_lines: <omitted>
I removed this role (it is just a security update..!) and re-ran...
This now progressed all the way to the end, and I am just waiting for packer to finish creating the image...
...and now it has all completed. The cluster appears to be working. I'll add more if I find any other issues.
The text was updated successfully, but these errors were encountered:
Just to add to this, booting the GPU node on a BM.GPU4.8 fails because Shape BM.GPU4.8 is not valid for image (error in /var/log/slurm/elastic.log). It seems that the base default Oracle 8 image is not set as being compatible with the bare-metal GPU4 nodes.
The fix I have tried is going into the Oracle console, finding the image, clicking "edit" and then adding GPU4.8 to the list of supported shapes for this image (I had already rebuilt the image with cuda drivers etc, although the error came up for the default CitC image created by packer at install time too).
(docs on how to see and change the compatible shapes for an image are here - search in page for "To view compatible shapes for a custom image")
There were warnings in the Oracle console that any incompatibilities between the image and shape could mean that the shape wouldn't boot. However, submitting a job does result in the node being provisioned, and a test job that ran nvidia-smi ran without issue via slurm and showed the output I would expect (all 8 GPUs visible and usable)
This is more for info than anything else... plus is some useful rubber ducking / help for anyone else encountering the same issue.
The
run_ansible
script exits with errors when building clusters on Oracle Linux. The issue is that theansible
package was updated in May 2022, and this now uses Python 3.8. Somehow the packaging of Ansible on Oracle Linux fails to include various important module, e.g. python-ldap, PyMySQL amongst others. This means thatrun_ansible
is unable to run and exits with an error. This is now caught by thefinish
script, which gives the impression that the cluster is still being built, and so installation appears to hang.I tried downgrading ansible to the old version, but this appears to have disappeared from
ol8-appstream
, and my attempt to install fromepel
(by disablingol8-appstream
) failed because of unresolvable dependencies.The solution I found was, as root, to
yum remove ansible
and then runpip3.6 install ansible
. This installed ansible against python 3.6 (which has all of the required modules) and placed the result in/usr/local/bin
. I then updated the path in therun_ansible
script to/usr/local/bin/ansible-playbook
and this completed every stage but one. Thesecurity-updates
role failed with the errorI removed this role (it is just a security update..!) and re-ran...
This now progressed all the way to the end, and I am just waiting for packer to finish creating the image...
...and now it has all completed. The cluster appears to be working. I'll add more if I find any other issues.
The text was updated successfully, but these errors were encountered: