Skip to content

Commit

Permalink
Extend wait_for maas.py, wait_for_* attempts arg
Browse files Browse the repository at this point in the history
1. maas.py: Extend wait_for states with timeout param

Extend the wait_for states with a timeout parameter.
The timeout value is taken from reclass pillar data if
defined. Oterwise, the states use the default value.
Based on Ting's PR [1], slightly refactored.

2. maas.py: wait_for_*: Add attempts arg

Introduce a new parameter that allows a maximum number of automatic
recovery attempts for the common failures w/ machine operations.
If not present in pillar data, it defaults to 0 (OFF).

Common error states, possible cause and automatic recovery pattern:
* New
  - usually indicates issues with BMC connectivity (no network route,
    but on rare occassions it happens due to MaaS API being flaky);
  - fix: delete the machine, (re)process machine definitions;
* Failed commissioning
  - various causes, usually a simple retry works;
  - fix: delete the machine, (re)process machine definitions;
* Failed testing
  - incompatible hardware, missing drivers etc.
  - usually consistent and board-specific;
  - fix: override failed testing
* Allocated
  - on rare ocassions nodes get stuck in this state instead 'Deploy';
  - fix: mark-broken, mark-fixed, if it failed at least once before
    perform a fio test (fixes another unrelated spurious issue with
    encrypted disks from previous deployments), (re)deploy machines;
* Failed deployment
  - various causes, usually a simple retry works;
  - fix: same as for nodes stuck in 'Allocated';

[1] salt-formulas#34

Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15
Signed-off-by: ting wu <[email protected]>
Signed-off-by: Alexandru Avadanii <[email protected]>
  • Loading branch information
alexandruavadanii committed Nov 8, 2018
1 parent d786e5f commit bde9ad6
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 7 deletions.
9 changes: 7 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -607,12 +607,16 @@ Wait for status of selected machine's:
machines:
- kvm01
- kvm02
timeout: 1200 # in seconds
timeout: {{ region.timeout.ready }}
attempts: {{ region.timeout.attempts }}
req_status: "Ready"
- require:
- cmd: maas_login_admin
...
The timeout setting is taken from the reclass pillar data.
If the pillar data is not defined, it will use the default value.

If module run w/\o any extra paremeters,
``wait_for_machines_ready`` will wait for defined in salt
machines. In this case, it is usefull to skip some machines:
Expand All @@ -627,7 +631,8 @@ machines. In this case, it is usefull to skip some machines:
module.run:
- name: maas.wait_for_machine_status
- kwargs:
timeout: 1200 # in seconds
timeout: {{ region.timeout.deployed }}
attempts: {{ region.timeout.attempts }}
req_status: "Deployed"
ignore_machines:
- kvm01 # in case it's broken or whatever
Expand Down
48 changes: 43 additions & 5 deletions _modules/maas.py
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,7 @@ def wait_for_machine_status(cls, **kwargs):
req_status: string; Polling status
machines: list; machine names
ignore_machines: list; machine names
attempts: max number of automatic hard retries
:ret: True
Exception - if something fail/timeout reached
"""
Expand All @@ -929,6 +930,8 @@ def wait_for_machine_status(cls, **kwargs):
req_status = kwargs.get("req_status", "Ready")
to_discover = kwargs.get("machines", None)
ignore_machines = kwargs.get("ignore_machines", None)
attempts = kwargs.get("attempts", 0)
failed_attempts = {}
if not to_discover:
try:
to_discover = __salt__['config.get']('maas')['region'][
Expand All @@ -943,11 +946,44 @@ def wait_for_machine_status(cls, **kwargs):
while len(total) <= len(to_discover):
for m in to_discover:
for discovered in MachinesStatus.execute()['machines']:
if m == discovered['hostname'] and \
discovered['status'].lower() == req_status.lower():
if m in total:
if m == discovered['hostname'] and m in total:
if discovered['status'].lower() == req_status.lower():
total.remove(m)

elif attempts > 0 and (m not in failed_attempts or
failed_attempts[m] < attempts):
status = discovered['status']
sid = discovered['system_id']
cls._maas = _create_maas_client()
if status in ['Failed commissioning', 'New']:
LOG.info('Machine {0} deleted'.format(sid))
cls._maas.delete(u'api/2.0/machines/{0}/'
.format(sid))
Machine().process()
elif status in ['Failed testing']:
data = {}
LOG.info('Machine {0} overriden'.format(sid))
action = 'override_failed_testing'
cls._maas.post(u'api/2.0/machines/{0}/'
.format(sid, action, **data))
elif status in ['Failed deployment', 'Allocated']:
data = {}
LOG.info('Machine {0} mark broken'.format(sid))
cls._maas.post(u'api/2.0/machines/{0}/'
.format(sid, 'mark_broken', **data))
LOG.info('Machine {0} mark fixed'.format(sid))
cls._maas.post(u'api/2.0/machines/{0}/'
.format(sid, 'mark_fixed', **data))
if m in failed_attempts and failed_attempts[m]:
LOG.info('Machine {0} fio test'.format(sid))
data['testing_scripts'] = 'fio'
cls._maas.post(u'api/2.0/machines/{0}/'
.format(sid, 'commission', **data))
DeployMachines().process()
else:
continue
if m not in failed_attempts:
failed_attempts[m] = 0
failed_attempts[m] = failed_attempts[m] + 1
if len(total) <= 0:
LOG.debug(
"Machines:{} are:{}".format(to_discover, req_status))
Expand All @@ -959,7 +995,9 @@ def wait_for_machine_status(cls, **kwargs):
"Waiting status:{} "
"for machines:{}"
"\nsleep for:{}s "
"Timeout:{}s".format(req_status, total, poll_time, timeout))
"Timeout:{}s ({}s left)"
.format(req_status, total, poll_time, timeout,
timeout - (time.time() - started_at)))
time.sleep(poll_time)


Expand Down
2 changes: 2 additions & 0 deletions maas/machines/wait_for_deployed.sls
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,7 @@ wait_for_machines_deployed:
- name: maas.wait_for_machine_status
- kwargs:
req_status: "Deployed"
timeout: {{ region.timeout.deployed }}
attempts: {{ region.timeout.attempts }}
- require:
- cmd: maas_login_admin
3 changes: 3 additions & 0 deletions maas/machines/wait_for_ready.sls
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,8 @@ maas_login_admin:
wait_for_machines_ready:
module.run:
- name: maas.wait_for_machine_status
- kwargs:
timeout: {{ region.timeout.ready }}
attempts: {{ region.timeout.attempts }}
- require:
- cmd: maas_login_admin
4 changes: 4 additions & 0 deletions maas/map.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ Debian:
bind:
host: 0.0.0.0
port: 80
timeout:
ready: 1200
deployed: 7200
attempts: 0
{%- endload %}

{%- set region = salt['grains.filter_by'](region_defaults, merge=salt['pillar.get']('maas:region', {})) %}
Expand Down
4 changes: 4 additions & 0 deletions tests/pillar/maas_region.sls
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,7 @@ maas:
password: password
username: maas
salt_master_ip: 127.0.0.1
timeout:
deployed: 900
ready: 900
attempts: 2

0 comments on commit bde9ad6

Please sign in to comment.