-
Notifications
You must be signed in to change notification settings - Fork 2
High Availability
Stephen von Takach edited this page Dec 12, 2016
·
1 revision
A full copy of the database will be replicated to each edge location using XDCR
This how each edge controller is represented in the database
- class
ControlSystem
- edge_control_id (optional)
- class
EdgeControl
- doc name
edge
- name – String name of that edge location
- description
- failover – Bool should the master server be taking over if this location goes down
- timeout – Failover timeout, how long before we act on the failure
- window_start – A CRON string representing the start of the restore window (if undefined then must be manually restored)
- window_length – The length of the restore window
- admins – Array of users that can administer this location
- commit – The current commit version of the edge locations code
This describes the process of booting the master control system. Module loading etc
- Load all the edge locations
- Load all the edge systems
- Build a list of edge location modules
- Load all modules, except for those controlled by edge locations
- Wait for edge locations (they’ll retry every 3 seconds so wait 6 seconds max)
- perform any failover actions that are required
- Mark system as booted
- inform edge servers that master is ready
- accept API requests
Modules controlled by edge locations will still be contactable from control systems on the master server via proxies and promises over a TCP link to the edge location.
Uses an environmental variable to detect it is an edge system
- Loads only its edge location document
- Loads the systems under its control (map-reduce)
- Loads a list of modules it is to control
- Attempts to connect to the master server
Considered disconnected after waiting 3 seconds and the following attempted re-connect fails
- Loads the modules and connects to devices if not already
- Continue to re-connection to the master
- Authenticates with master using a shared secret over SSL
- Negotiates control
- If devices loaded and connected then edge location wins (assumes some kind of master/network outage)
- If during the restore window the edge location wins
- If master is loading (and hasn’t loaded edge modules) then edge wins
- Else the master wins (and will take control of the devices)
- Loads the modules and connects to devices if winner and hasn’t already
Effectively the Master only wins control if it already has control
- TCP messaging connection maintained
- Accept the following requests:
- Repository request (reset, pull, etc possibly followed by a live reload)
- Live reload (might include file data if edited inline)
- Data model updated (settings, IP address, port, URI, etc)
- Expire system cache (due to data update)
- Stop / Start system
- Execute method request
- Status request
- Debug message proxy
Multiple requests can be executing simultaneously with IDs for tracking.
- Client must be aware of the edge server
- Edge server only services API (no interface code)
- Interface needs to cache on the client device
- Master and edge should be able to direct clients to each other (when restores or failures occur)
- Clients can ignore this
- If ignored then the edge or master server will do its best to proxy requests (might not be possible in the case of failure)