Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terrform script hangs at run_ansible. #38

Closed
p-1603 opened this issue Jun 10, 2021 · 4 comments
Closed

Terrform script hangs at run_ansible. #38

p-1603 opened this issue Jun 10, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@p-1603
Copy link

p-1603 commented Jun 10, 2021

Hello,

I think this problem may or may not be related to issue #34.

I have created 3 CITC clusters on Oracle OCI over the last 24 hours using the CITC docs, and after running terraform apply oracle and ssh'ing into the management node, I find that the finish script doesn't run, giving only the following error:

[citc@mgmt ~]$ finish Error: The management node has not finished its setup Please allow it to finish before continuing. For information about why they have not finished, check the file /root/ansible-pull.log

The ansible-pull.log is as follows:

Thursday 10 June 2021  12:21:11 +0000 (0:00:00.080)       0:02:21.844 *********
===============================================================================
slurm : install Slurm ------------------------------------------------------ 14.67s
install ssh server ---------------------------------------------------------- 5.83s
mysql : install mariadb module ---------------------------------------------- 5.45s
ldap : enable 389-ds module ------------------------------------------------- 5.42s
ldap : install 389-ds ------------------------------------------------------- 5.29s
slurm : install munge ------------------------------------------------------- 5.10s
install python38 ------------------------------------------------------------ 5.09s
slurm : install ssh-keygen -------------------------------------------------- 4.88s
packer : Ensure unzip is installed. ----------------------------------------- 4.87s
filesystem : install nfs-utils ---------------------------------------------- 4.86s
slurm : install SELinux-policy ---------------------------------------------- 4.85s
slurm : install firewalld --------------------------------------------------- 4.85s
mysql : install mariadb ----------------------------------------------------- 4.84s
ldap : install nss-tools ---------------------------------------------------- 4.83s
mysql : install PyMySQL ----------------------------------------------------- 4.81s
security_updates : Install security updates --------------------------------- 4.80s
slurm : install python-firewall --------------------------------------------- 4.80s
slurm : install common tools ------------------------------------------------ 3.96s
slurm : install OCI tools --------------------------------------------------- 3.69s
Gathering Facts ------------------------------------------------------------- 1.20s
Playbook run took 0 days, 0 hours, 2 minutes, 21 seconds

Even after leaving the script overnight, no further progress was made.

I found that there was one failure in the ansible-pull.log prior to this:

TASK [security_updates : Install security updates] *********************************Thursday 10 June 2021  12:21:06 +0000 (0:00:00.439)       0:02:16.960 *********
fatal: [mgmt.subnet.intimatecollie.oraclevcn.com]: FAILED! => changed=true
  cmd:
  - dnf
  - update
  - -y
  - --security
  - --exclude
  - kernel*
  delta: '0:00:04.545339'
  end: '2021-06-10 12:21:11.715756'
  msg: non-zero return code
  rc: 1
  start: '2021-06-10 12:21:07.170417'
  stderr: |-
    Error:
     Problem: problem with installed package slurm-libpmi-20.02.5-1.20.x86_64
      - package slurm-libpmi-20.02.5-1.20.x86_64 requires libslurmfull.so()(64bit), but none of the providers can be installed
      - package slurm-libpmi-20.02.5-1.20.x86_64 requires slurm(x86-64) = 20.02.5-1.20, but none of the providers can be installed
      - cannot install both slurm-20.11.7-3.el8.x86_64 and slurm-20.02.5-1.20.x86_64      - cannot install both slurm-20.02.5-1.20.x86_64 and slurm-20.11.7-3.el8.x86_64      - cannot install the best update candidate for package slurm-20.02.5-1.20.x86_64
  stderr_lines: <omitted>
  stdout: |-
    Last metadata expiration check: 2:12:54 ago on Thu Jun 10 10:08:14 2021.
    (try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
  stdout_lines: <omitted>

If I comment out the - security_updates role in the management.yml in the citc-ansible checkout, and re-run run_ansible, then it runs until here:

PLAY [finalise] ********************************************************************
TASK [finalise : Wait for packer to finish] ****************************************Thursday 10 June 2021  12:29:13 +0000 (0:00:00.908)       0:02:47.083 *********
FAILED - RETRYING: Wait for packer to finish (100 retries left).
FAILED - RETRYING: Wait for packer to finish (99 retries left).
FAILED - RETRYING: Wait for packer to finish (98 retries left).
FAILED - RETRYING: Wait for packer to finish (97 retries left).
FAILED - RETRYING: Wait for packer to finish (96 retries left).
FAILED - RETRYING: Wait for packer to finish (95 retries left).
FAILED - RETRYING: Wait for packer to finish (94 retries left).
FAILED - RETRYING: Wait for packer to finish (93 retries left).
FAILED - RETRYING: Wait for packer to finish (92 retries left).
FAILED - RETRYING: Wait for packer to finish (91 retries left).
FAILED - RETRYING: Wait for packer to finish (90 retries left).
FAILED - RETRYING: Wait for packer to finish (89 retries left).
FAILED - RETRYING: Wait for packer to finish (88 retries left).
FAILED - RETRYING: Wait for packer to finish (87 retries left).
FAILED - RETRYING: Wait for packer to finish (86 retries left).
FAILED - RETRYING: Wait for packer to finish (85 retries left).
FAILED - RETRYING: Wait for packer to finish (84 retries left).
FAILED - RETRYING: Wait for packer to finish (83 retries left).
FAILED - RETRYING: Wait for packer to finish (82 retries left).
FAILED - RETRYING: Wait for packer to finish (81 retries left).

I set up a cluster using the same instructions last week, when it seemed to work as normal and I could run finish within a few minutes of running terraform apply.

@willfurnass
Copy link

I think the first problem might be that the 'powertools' repository isn't defined when Ansible tries installing Slurm on the mgmtnode - the powertools repo is enabled but only by a role that would be applied much later in the management.yml Ansible playbook.

@milliams milliams added the bug Something isn't working label Jul 19, 2021
@milliams
Copy link
Member

I have identified the cause of the problem. In this case, it is because the "update all packages" script was trying to update to an newer version of Slurm which is not packaged fully yet. I have added a fix to the Ansible package so that a new cluster should not build correctly.

Often the simplest solution with CitC is to destroy and recreate a cluster from scratch to get newer version. In this case, you can apply the fix by logging in to the management node, and as root running /root/update_ansible_repo and then /root/run_ansible.

Will is right though that PowerTools has caused similar issues in the past due to case sensitivity.

@p-1603
Copy link
Author

p-1603 commented Jul 21, 2021

Hello again,

I created another cluster (Oracle) earlier today, and found that the conflicts in ansible were resolved, but the ansible script still only ran up until this point, where it hung. This meant that the finish script can still not be run.

TASK [finalise : Wait for packer to finish] ************************************
Wednesday 21 July 2021  13:55:30 +0000 (0:00:01.552)       0:09:15.005 ********
FAILED - RETRYING: Wait for packer to finish (100 retries left).
FAILED - RETRYING: Wait for packer to finish (99 retries left).
FAILED - RETRYING: Wait for packer to finish (98 retries left).
FAILED - RETRYING: Wait for packer to finish (97 retries left).
FAILED - RETRYING: Wait for packer to finish (96 retries left).
FAILED - RETRYING: Wait for packer to finish (95 retries left).
FAILED - RETRYING: Wait for packer to finish (94 retries left).
FAILED - RETRYING: Wait for packer to finish (93 retries left).
FAILED - RETRYING: Wait for packer to finish (92 retries left).
FAILED - RETRYING: Wait for packer to finish (91 retries left).
FAILED - RETRYING: Wait for packer to finish (90 retries left).
FAILED - RETRYING: Wait for packer to finish (89 retries left).
FAILED - RETRYING: Wait for packer to finish (88 retries left).
FAILED - RETRYING: Wait for packer to finish (87 retries left).
FAILED - RETRYING: Wait for packer to finish (86 retries left).
FAILED - RETRYING: Wait for packer to finish (85 retries left).
FAILED - RETRYING: Wait for packer to finish (84 retries left).
FAILED - RETRYING: Wait for packer to finish (83 retries left).
FAILED - RETRYING: Wait for packer to finish (82 retries left).
FAILED - RETRYING: Wait for packer to finish (81 retries left).
FAILED - RETRYING: Wait for packer to finish (80 retries left).
FAILED - RETRYING: Wait for packer to finish (79 retries left).
FAILED - RETRYING: Wait for packer to finish (78 retries left).
FAILED - RETRYING: Wait for packer to finish (77 retries left).
FAILED - RETRYING: Wait for packer to finish (76 retries left).
FAILED - RETRYING: Wait for packer to finish (75 retries left).
FAILED - RETRYING: Wait for packer to finish (74 retries left).
FAILED - RETRYING: Wait for packer to finish (73 retries left).
FAILED - RETRYING: Wait for packer to finish (72 retries left).
FAILED - RETRYING: Wait for packer to finish (71 retries left).
FAILED - RETRYING: Wait for packer to finish (70 retries left).
FAILED - RETRYING: Wait for packer to finish (69 retries left).
FAILED - RETRYING: Wait for packer to finish (68 retries left).
FAILED - RETRYING: Wait for packer to finish (67 retries left).
FAILED - RETRYING: Wait for packer to finish (66 retries left).
FAILED - RETRYING: Wait for packer to finish (65 retries left).
FAILED - RETRYING: Wait for packer to finish (64 retries left).

I'll update this if I discover anything more.

@milliams
Copy link
Member

The packer run can take a long time to finish, especially on Oracle. I have increased the time-out on the latest version on 6 to 200 attempts as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants