title | dep-number | creation-date | status | authors | reviewers | ||||
---|---|---|---|---|---|---|---|---|---|
operator out-of-band tasks |
5 |
6th Dec'2023 |
implementable |
|
|
- DEP-05: Operator Out-of-band Tasks
- Table of Contents
- Summary
- Terminology
- Motivation
- Goals
- Non-Goals
- Proposal
- Metrics
This DEP proposes an enhancement to etcd-druid
's capabilities to handle out-of-band tasks, which are presently performed manually or invoked programmatically via suboptimal APIs. The document proposes the establishment of a unified interface by defining a well-structured API to harmonize the initiation of any out-of-band
task, monitor its status, and simplify the process of adding new tasks and managing their lifecycles.
-
etcd-druid: etcd-druid is an operator to manage the etcd clusters.
-
backup-sidecar: It is the etcd-backup-restore sidecar container running in each etcd-member pod of etcd cluster.
-
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader of an etcd cluster.
-
out-of-band task: Any on-demand tasks/operations that can be executed on an etcd cluster without modifying the Etcd custom resource spec (desired state).
Today, etcd-druid mainly acts as an etcd cluster provisioner (creation, maintenance and deletion). In future, capabilities of etcd-druid will be enhanced via etcd-member proposal by providing it access to much more detailed information about each etcd cluster member. While we enhance the reconciliation and monitoring capabilities of etcd-druid, it still lacks the ability to allow users to invoke out-of-band
tasks on an existing etcd cluster.
There are new learnings while operating etcd clusters at scale. It has been observed that we regularly need capabilities to trigger out-of-band
tasks which are outside of the purview of a regular etcd reconciliation run. Many of these tasks are multi-step processes, and performing them manually is error-prone, even if an operator follows a well-written step-by-step guide. Thus, there is a need to automate these tasks.
Some examples of an on-demand/out-of-band
tasks:
- Recover from a permanent quorum loss of etcd cluster.
- Trigger an on-demand full/delta snapshot.
- Trigger an on-demand snapshot compaction.
- Trigger an on-demand maintenance of etcd cluster.
- Copy the backups from one object store to another object store.
- Establish a unified interface for operator tasks by defining a single dedicated custom resource for
out-of-band
tasks. - Define a contract (in terms of prerequisites) which needs to be adhered to by any task implementation.
- Facilitate the easy addition of new
out-of-band
task(s) through this custom resource. - Provide CLI capabilities to operators, making it easy to invoke supported
out-of-band
tasks.
- In the current scope, capability to abort/suspend an
out-of-band
task is not going to be provided. This could be considered as an enhancement based on pull. - Ordering (by establishing dependency) of
out-of-band
tasks submitted for the same etcd cluster has not been considered in the first increment. In a future version based on how operator tasks are used, we will enhance this proposal and the implementation.
Authors propose creation of a new single dedicated custom resource to represent an out-of-band
task. Etcd-druid will be enhanced to process the task requests and update its status which can then be tracked/observed.
EtcdOperatorTask
is the new custom resource that will be introduced. This API will be in v1alpha1
version and will be subject to change. We will be respecting Kubernetes Deprecation Policy.
// EtcdOperatorTask represents an out-of-band operator task resource.
type EtcdOperatorTask struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec is the specification of the EtcdOperatorTask resource.
Spec EtcdOperatorTaskSpec `json:"spec"`
// Status is most recently observed status of the EtcdOperatorTask resource.
Status EtcdOperatorTaskStatus `json:"status,omitempty"`
}
The authors propose that the following fields should be specified in the spec (desired state) of the EtcdOperatorTask
custom resource.
- To capture the type of
out-of-band
operator task to be performed,.spec.type
field should be defined. It can have values from all supportedout-of-band
tasks eg. "OnDemandSnaphotTask", "QuorumLossRecoveryTask" etc. - To capture the configuration specific to each task, a
.spec.config
field should be defined of typestring
as each task can have different input configuration.
// EtcdOperatorTaskSpec is the spec for a EtcdOperatorTask resource.
type EtcdOperatorTaskSpec struct {
// Type specifies the type of out-of-band operator task to be performed.
Type string `json:"type"`
// Config is a task specific configuration.
Config string `json:"config,omitempty"`
// TTLSecondsAfterFinished is the time-to-live to garbage collect the
// related resource(s) of task once it has been completed.
// +optional
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
// OwnerEtcdReference refers to the name and namespace of the corresponding
// Etcd owner for which the task has been invoked.
OwnerEtcdRefrence types.NamespacedName `json:"ownerEtcdRefrence"`
}
The authors propose the following fields for the Status (current state) of the EtcdOperatorTask
custom resource to monitor the progress of the task.
// EtcdOperatorTaskStatus is the status for a EtcdOperatorTask resource.
type EtcdOperatorTaskStatus struct {
// ObservedGeneration is the most recent generation observed for the resource.
ObservedGeneration *int64 `json:"observedGeneration,omitempty"`
// State is the last known state of the task.
State TaskState `json:"state"`
// Time at which the task has moved from "pending" state to any other state.
InitiatedAt metav1.Time `json:"initiatedAt"`
// LastError represents the errors when processing the task.
// +optional
LastErrors []LastError `json:"lastErrors,omitempty"`
// Captures the last operation status if task involves many stages.
// +optional
LastOperation *LastOperation `json:"lastOperation,omitempty"`
}
type LastOperation struct {
// Name of the LastOperation.
Name opsName `json:"name"`
// Status of the last operation, one of pending, progress, completed, failed.
State OperationState `json:"state"`
// LastTransitionTime is the time at which the operation state last transitioned from one state to another.
LastTransitionTime metav1.Time `json:"lastTransitionTime"`
// A human readable message indicating details about the last operation.
Reason string `json:"reason"`
}
// LastError stores details of the most recent error encountered for the task.
type LastError struct {
// Code is an error code that uniquely identifies an error.
Code ErrorCode `json:"code"`
// Description is a human-readable message indicating details of the error.
Description string `json:"description"`
// ObservedAt is the time at which the error was observed.
ObservedAt metav1.Time `json:"observedAt"`
}
// TaskState represents the state of the task.
type TaskState string
const (
TaskStateFailed TaskState = "Failed"
TaskStatePending TaskState = "Pending"
TaskStateRejected TaskState = "Rejected"
TaskStateSucceeded TaskState = "Succeeded"
TaskStateInProgress TaskState = "InProgress"
)
// OperationState represents the state of last operation.
type OperationState string
const (
OperationStateFailed OperationState = "Failed"
OperationStatePending OperationState = "Pending"
OperationStateCompleted OperationState = "Completed"
OperationStateInProgress OperationState = "InProgress"
)
apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdOperatorTask
metadata:
name: <name of operator task resource>
namespace: <cluster namespace>
generation: <specific generation of the desired state>
spec:
type: <type/category of supported out-of-band task>
ttlSecondsAfterFinished: <time-to-live to garbage collect the custom resource after it has been completed>
config: <task specific configuration>
ownerEtcdRefrence: <refer to corresponding etcd owner name and namespace for which task has been invoked>
status:
observedGeneration: <specific observedGeneration of the resource>
state: <last known current state of the out-of-band task>
initiatedAt: <time at which task move to any other state from "pending" state>
lastErrors:
- code: <error-code>
description: <description of the error>
observedAt: <time the error was observed>
lastOperation:
name: <operation-name>
state: <task state as seen at the completion of last operation>
lastTransitionTime: <time of transition to this state>
reason: <reason/message if any>
Task(s) can be created by creating an instance of the EtcdOperatorTask
custom resource specific to a task.
Note: In future, either a
kubectl
extension plugin or adruidctl
tool will be introduced. Dedicated sub-commands will be created for eachout-of-band
task. This will drastically increase the usability for an operator for performing such tasks, as the CLI extension will automatically create relevant instance(s) ofEtcdOperatorTask
with the provided configuration.
- Authors propose to introduce a new controller which watches for
EtcdOperatorTask
custom resource. - Each
out-of-band
task may have some task specific configuration defined in .spec.config. - The controller needs to parse this task specific config, which comes as a string, according to the schema defined for each task.
- For every
out-of-band
task, a set ofpre-conditions
can be defined. These pre-conditions are evaluated against the current state of the target etcd cluster. Based on the evaluation result (boolean), the task is permitted or denied execution. - If multiple tasks are invoked simultaneously or in
pending
state, then they will be executed in a First-In-First-Out (FIFO) manner.
Note: Dependent ordering among tasks will be addressed later which will enable concurrent execution of tasks when possible.
Upon completion of the task, irrespective of its final state, Etcd-druid
will ensure the garbage collection of the task custom resource and any other Kubernetes resources created to execute the task. This will be done according to the .spec.ttlSecondsAfterFinished
if defined in the spec, or a default expiry time will be assumed.
Recovery from permanent quorum loss involves two phases - identification and recovery - both of which are done manually today. This proposal intends to automate the latter. Recovery today is a multi-step process and needs to be performed carefully by a human operator. Automating these steps would be prudent, to make it quicker and error-free. The identification of the permanent quorum loss would remain a manual process, requiring a human operator to investigate and confirm that there is indeed a permanent quorum loss with no possibility of auto-healing.
We do not need any config for this task. When creating an instance of EtcdOperatorTask
for this scenario, .spec.config
will be set to nil (unset).
- There should be a quorum loss in a multi-member etcd cluster. For a single-member etcd cluster, invoking this task is unnecessary as the restoration of the single member is automatically handled by the backup-restore process.
- There should not already be a permanent-quorum-loss-recovery-task running for the same etcd cluster.
Etcd-druid
provides a configurable etcd-events-threshold flag. When this threshold is breached, then a snapshot compaction is triggered for the etcd cluster. However, there are scenarios where an ad-hoc snapshot compaction may be required.
- If an operator anticipates a scenario of permanent quorum loss, they can trigger an
on-demand snapshot compaction
to create a compacted full-snapshot. This can potentially reduce the recovery time from a permanent quorum loss. - As an additional benefit, a human operator can leverage the current implementation of snapshot compaction, which internally triggers
restoration
. Hence, by initiating anon-demand snapshot compaction
task, the operator can verify the integrity of etcd cluster backups, particularly in cases of potential backup corruption or re-encryption. The success or failure of this snapshot compaction can offer valuable insights into these scenarios.
We do not need any config for this task. When creating an instance of EtcdOperatorTask
for this scenario, .spec.config
will be set to nil (unset).
- There should not be a
on-demand snapshot compaction
task already running for the same etcd cluster.
Note:
on-demand snapshot compaction
runs as a separate job in a separate pod, which interacts with the backup bucket and not the etcd cluster itself, hence it doesn't depend on the health of etcd cluster members.
Etcd
custom resource provides an ability to set FullSnapshotSchedule which currently defaults to run once in 24 hrs. DeltaSnapshotPeriod is also made configurable which defines the duration after which a delta snapshot will be taken.
If a human operator does not wish to wait for the scheduled full/delta snapshot, they can trigger an on-demand (out-of-schedule) full/delta snapshot on the etcd cluster, which will be taken by the leading-backup-restore
.
- An on-demand full snapshot can be triggered if scheduled snapshot fails due to any reason.
- Gardener Shoot Hibernation: Every etcd cluster incurs an inherent cost of preserving the volumes even when a gardener shoot control plane is scaled down, i.e the shoot is in a hibernated state. However, it is possible to save on hyperscaler costs by invoking this task to take a full snapshot before scaling down the etcd cluster, and deleting the etcd data volumes afterwards.
- Gardener Control Plane Migration: In gardener, a cluster control plane can be moved from one seed cluster to another. This process currently requires the etcd data to be replicated on the target cluster, so a full snapshot of the etcd cluster in the source seed before the migration would allow for faster restoration of the etcd cluster in the target seed.
// SnapshotType can be full or delta snapshot.
type SnapshotType string
const (
SnapshotTypeFull SnapshotType = "full"
SnapshotTypeDelta SnapshotType = "delta"
)
type OnDemandSnapshotTaskConfig struct {
// Type of on-demand snapshot.
Type SnapshotType `json:"type"`
}
spec:
config: |
type: <type of on-demand snapshot>
- Etcd cluster should have a quorum.
- There should not already be a
on-demand snapshot
task running with the sameSnapshotType
for the same etcd cluster.
Operator can trigger on-demand maintenance of etcd cluster which includes operations like etcd compaction, etcd defragmentation etc.
- If an etcd cluster is heavily loaded, which is causing performance degradation of an etcd cluster, and the operator does not want to wait for the scheduled maintenance window then an
on-demand maintenance
task can be triggered which will invoke etcd-compaction, etcd-defragmentation etc. on the target etcd cluster. This will make the etcd cluster lean and clean, thus improving cluster performance.
type OnDemandMaintenanceTaskConfig struct {
// MaintenanceType defines the maintenance operations need to be performed on etcd cluster.
MaintenanceType maintenanceOps `json:"maintenanceType`
}
type maintenanceOps struct {
// EtcdCompaction if set to true will trigger an etcd compaction on the target etcd.
// +optional
EtcdCompaction bool `json:"etcdCompaction,omitempty"`
// EtcdDefragmentation if set to true will trigger a etcd defragmentation on the target etcd.
// +optional
EtcdDefragmentation bool `json:"etcdDefragmentation,omitempty"`
}
spec:
config: |
maintenanceType:
etcdCompaction: <true/false>
etcdDefragmentation: <true/false>
- Etcd cluster should have a quorum.
- There should not already be a duplicate task running with same
maintenanceType
.
Copy the backups(full and delta snapshots) of etcd cluster from one object store(source) to another object store(target).
- In Gardener, the Control Plane Migration process utilizes the copy-backups task. This task is responsible for copying backups from one object store to another, typically located in different regions.
// EtcdCopyBackupsTaskConfig defines the parameters for the copy backups task.
type EtcdCopyBackupsTaskConfig struct {
// SourceStore defines the specification of the source object store provider.
SourceStore StoreSpec `json:"sourceStore"`
// TargetStore defines the specification of the target object store provider for storing backups.
TargetStore StoreSpec `json:"targetStore"`
// MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
// By default all backups will be copied.
// +optional
MaxBackupAge *uint32 `json:"maxBackupAge,omitempty"`
// MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
// +optional
MaxBackups *uint32 `json:"maxBackups,omitempty"`
}
spec:
config: |
sourceStore: <source object store specification>
targetStore: <target object store specification>
maxBackupAge: <maximum age in days that a backup must have in order to be copied>
maxBackups: <maximum no. of backups that will be copied>
Note: For detailed object store specification please refer here
- There should not already be a
copy-backups
task running.
Note:
copy-backups-task
runs as a separate job, and it operates only on the backup bucket, hence it doesn't depend on health of etcd cluster members.
Note:
copy-backups-task
has already been implemented and it's currently being used in Control Plane Migration butcopy-backups-task
will be harmonized withEtcdOperatorTask
custom resource.
Authors proposed to introduce the following metrics:
-
etcddruid_operator_task_duration_seconds
: Histogram which captures the runtime for each etcd operator task. Labels:- Key:
type
, Value: all supported tasks - Key:
state
, Value: One-Of {failed, succeeded, rejected} - Key:
etcd
, Value: name of the target etcd resource - Key:
etcd_namespace
, Value: namespace of the target etcd resource
- Key:
-
etcddruid_operator_tasks_total
: Counter which counts the number of etcd operator tasks. Labels:- Key:
type
, Value: all supported tasks - Key:
state
, Value: One-Of {failed, succeeded, rejected} - Key:
etcd
, Value: name of the target etcd resource - Key:
etcd_namespace
, Value: namespace of the target etcd resource
- Key: