-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Basic information
- Architecture
- Communication
- Contract
- Basic and corner use cases
ExpertController is a service responsible for:
- coordinating automatic recovery actions
- prompting operator for approval of recovery steps proposed by DAQExpert
- exposing following APIs to accept recovery requests, browse recovery records
- communicating with Dashboard (main UI for operators in control room) via websocket to request approvals and update the status of ongoing recovery
ExpertController is a separate microservice (similarly to NotificationManager). The ExpertController receives requests from DAQExpert and is responsible for controlling the L0, and L0Automator FMs. Main reasons for going this direction instead of making this integral part od DAQExpert:
- Avoid going towards monolith application (same as for NotificationManager)
- Avoid increased outage window due to increasing server startup time (again monolith application)
- Avoid full redeployment if minor changes needed -> smaller mean time to recover
- Loose coupling - if we want we can change for example the expert technology in the future, we will remain technology-agnostic for individual micro services
- Agility - splitting complexity of the code, avoid increasing build time which is better for build automation, CI and unit testing
- Better stability - problems affect part of the system, not all
In short this follows the current approach to isolate different subdomains of a whole system:
- DAQExpert - logic context
- NotificationManager - distributing notifications
- ExpertController - preforming recovery jobs
- DAQAggregator - aggregating and persisting monitoring data
- DAQSnapshotProvider - serving data
- DAQView - presenting monitoring data
The communication between DAQExpert and ExpertController is synchronous. Numerous exceptional cases were analysed to evaluate if it's more suitable approach than asynchronous:
- DAQExpert tries to recover and controller is not available
- DAQExpert tries to recover and controller crashes during the recovery
- DAQExpert tries to recover and LVL0 or Automator fails during recovery
- DAQExpert tries to recover and finds more fundamental problem few snapshots later and want's to issue another recovery while old one is in progress
- DAQExpert tries to recover but problem spontaneously fixes itself for any reason
The communication sequence diagram shows the interaction between 3 components taking part in whole recovery situation: DAQExpert, ExpertController and Dashboard.
Sent by DAQExpert to ExpertController. Describes what are the recovery steps.
Sent by ExpertController to DAQExpert. It states whether recovery was accepted or rejected. There is only one possible rejection reason: other recovery is running. In this case DAQExpert may resend the recovery request explicitly stating to preempt the current recovery, continue with one or postpone one.
Sent by ExpertController to Dashboard when recovery has been accepted to process by ExpertController and on each update.
Sent by ExpertController to Dashboard to prompt operator for approval for give recovery.
Sent by Dashboard to ExpertController on operator action.
- Single condition situation - one recovery solves
- Multiple condition situation - 2 recovery steps solves
- Preemption before accepted
- Preemption after accepted
- Multiple clients have open dashboard - try to approve the request
- Client connects to Dashboard in the middle of recovery
- Recovery interrupted
- Less important condition emerges while pervious is in observe period - results in postponement
- More important preempts one, than disappears quiclky and less important will be back to accept
- Controller is unavailable and expert sends the requests