Skip to content
This repository has been archived by the owner on Jul 2, 2021. It is now read-only.
Maciej Gladki edited this page Apr 20, 2018 · 2 revisions

Automatic recoveries

  1. Basic information
  2. Architecture
  3. Communication
  4. Contract
  5. Basic and corner use cases

Basic information

ExpertController is a service responsible for:

  • coordinating automatic recovery actions
  • prompting operator for approval of recovery steps proposed by DAQExpert
  • exposing following APIs to accept recovery requests, browse recovery records
  • communicating with Dashboard (main UI for operators in control room) via websocket to request approvals and update the status of ongoing recovery

Architecture

ExpertController is a separate microservice (similarly to NotificationManager). The ExpertController receives requests from DAQExpert and is responsible for controlling the L0, and L0Automator FMs. Main reasons for going this direction instead of making this integral part od DAQExpert:

  • Avoid going towards monolith application (same as for NotificationManager)
  • Avoid increased outage window due to increasing server startup time (again monolith application)
  • Avoid full redeployment if minor changes needed -> smaller mean time to recover
  • Loose coupling - if we want we can change for example the expert technology in the future, we will remain technology-agnostic for individual micro services
  • Agility - splitting complexity of the code, avoid increasing build time which is better for build automation, CI and unit testing
  • Better stability - problems affect part of the system, not all

In short this follows the current approach to isolate different subdomains of a whole system:

  • DAQExpert - logic context
  • NotificationManager - distributing notifications
  • ExpertController - preforming recovery jobs
  • DAQAggregator - aggregating and persisting monitoring data
  • DAQSnapshotProvider - serving data
  • DAQView - presenting monitoring data

Communication

The communication between DAQExpert and ExpertController is synchronous. Numerous exceptional cases were analysed to evaluate if it's more suitable approach than asynchronous:

  • DAQExpert tries to recover and controller is not available
  • DAQExpert tries to recover and controller crashes during the recovery
  • DAQExpert tries to recover and LVL0 or Automator fails during recovery
  • DAQExpert tries to recover and finds more fundamental problem few snapshots later and want's to issue another recovery while old one is in progress
  • DAQExpert tries to recover but problem spontaneously fixes itself for any reason

The communication sequence diagram shows the interaction between 3 components taking part in whole recovery situation: DAQExpert, ExpertController and Dashboard.

Contract

Recovery request

Sent by DAQExpert to ExpertController. Describes what are the recovery steps.

Recovery response

Sent by ExpertController to DAQExpert. It states whether recovery was accepted or rejected. There is only one possible rejection reason: other recovery is running. In this case DAQExpert may resend the recovery request explicitly stating to preempt the current recovery, continue with one or postpone one.

Recovery status

Sent by ExpertController to Dashboard when recovery has been accepted to process by ExpertController and on each update.

Approval request

Sent by ExpertController to Dashboard to prompt operator for approval for give recovery.

Approval response

Sent by Dashboard to ExpertController on operator action.

Use cases

Basic scenarios

  1. Single condition situation - one recovery solves
  2. Multiple condition situation - 2 recovery steps solves
  3. Preemption before accepted
  4. Preemption after accepted

Corner cases

  1. Multiple clients have open dashboard - try to approve the request
  2. Client connects to Dashboard in the middle of recovery
  3. Recovery interrupted
  4. Less important condition emerges while pervious is in observe period - results in postponement
  5. More important preempts one, than disappears quiclky and less important will be back to accept

Exceptional cases

  1. Controller is unavailable and expert sends the requests