Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Custom Health Checks for Kustomization using Common Expression Language(CEL) #4528

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
330 changes: 330 additions & 0 deletions rfcs/0000-custom-health-checks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
# RFC-0000 Custom Health Checks for Kustomization using Common Expression Language(CEL)

**Status:** provisional

**Creation date:** 2024-01-05

**Last update:** 2024-01-05

## Summary

This RFC proposes to support customization of the status readers in `Kustomizations`
during the `healthCheck` phase for custom resources. The user will be able to declare
the needed `conditions` in order to compute a custom resource status.
In order to provide flexibility, we propose to use `CEL` expressions to declare
the expected conditions and their status.
This will introduce a new field `customHealthChecks` in the `Kustomization` CRD
which will be a list of `CustomHealthCheck` objects.

## Motivation

Flux uses the `Kstatus` library during the `healthCheck` phase to compute owned
resources status. This works just fine for all standard resources and custom resources
that comply with `Kstatus` interfaces.

In the current Kustomization implementation, we have addressed such a problem for
kubernetes Jobs. We have implemented a `customJobStatusReader` that computes the
status of a Job based on a defined set of conditions. This is a good solution for
Jobs, but it is not generic and thus not applicable to other custom resources.

Another use case is relying on non-standard `conditions` to compute the status of
a custom resource. For example, we might want to compute the status of a custom
resource based on a condtion other then `Ready`. This is the case for `Resources`
that do intermediate patching like `Certificate` where you should look at the `Issued`
condition to know if the certificate has been issued or not before looking at the
`Ready` condition.

In order to provide a generic solution for custom resources, that would not imply
writing a custom status reader for each new custom resource, we need to provide a
way for the user to express the `conditions` that need to be met in order to compute
the status of a given custom resource. And we need to do this in a way that is
flexible enough to cover all possible use cases, without having to change `Flux`
source code for each new use case.

### Goals

- provide a generic solution for user to customize the health check of custom resources
- support non-standard resources in `kustomize-controller`

### Non-Goals

- We do not plan to support custom `healthChecks` for core resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this restriction ?

It seems perhaps arbitrary to not allow custom heathChecks for core resources. For instance for a Deployment, depending on the situation we might sometimes want ready to mean "all pods are ready", or "at least one pod is ready", or "at least minAvailable pods are ready".

Copy link
Member

@stefanprodan stefanprodan Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that kstatus covers all the builtin types, excluding them on a per kind bases is not possible, so we would need to drop them all and make users write hundreds of lines of CEL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seen from a user standpoint it feel like this might be implemented as "use kstatus for kind XXX unless CustomHealthChecksExprs defines something for XXX".

Looking closer, I understand that the proposal is to reuse the kstatus statusreaders extension point.

However, after looking at kstatus code, I was wondering if the built-in statusreaders couldn't be all dropped and then the ones not covered by any CustomHealthChecksExprs would be added back (NewReplicaSetStatusReader/NewDeploymentResourceReader/NewStatefulSetResourceReader).

(Of course, don't delay this RFC just based on this topic, I haven't digged enough to have any idea of what is reasonably feasible with kstatus.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this RFC is to allow defining health checks for custom resources. Once this is implemented, we could consider expanding CEL to native resources in a followup RFC, if it deems to be feasible and if Flux users will ask for it. So far, no one asked for more than what kstatus does for builtin kinds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on Stefan answer, the default poller uses a slice of readers:

  statusReaders = append(statusReaders,
    deploymentStatusReader,
    statefulSetStatusReader,
    replicaSetStatusReader,
    defaultStatusReader,
  )

There is a Support() method that will be used to find a reader that support a given objet Kind. The defaultStatusReader is the one that is used for all custom resources. It is able to compute the status if the CR is compliant with Kstatus. The actual proposal is to expand this list. So in order to support native kinds we would need to remove the other Readers.

Another thing to consider is the SSA. Objects are applied in stages, and we perform a normalization of all native kinds mainly working around various upstream Kubernetes API bugs. This implies that we expect a certain behaviour. Also normalization could invalidate a CEL expression as we pass the whole object to the evaluator.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keeping the thing simple

my ideas of refinements may prove relevant later... or not ;)


## Proposal

### Introduce a new field `CustomHealthChecksExprs` in the `Kustomization` CRD

The `CustomHealthChecksExprs` field will be a list of `CustomHealthCheck` objects.
Each `CustomHealthChecksExprs` object will have a `apiVersion`, `kind`, `inProgress`,
`failed` and `current` fields.

To give an example, here is how we would declare a custom health check for a `Certificate`
resource:

```yaml
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: app-certificate
namespace: cert-manager
spec:
commonName: cert-manager-tls
dnsNames:
- app.ns.svc.cluster.local
ipAddresses:
- x.x.x.x
isCA: true
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: app-issuer
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
secretName: app-tls-certs
subject:
organizations:
- example.com
```

This `Certificate` resource will transition through the following `conditions`:
`Issuing` and `Ready`.

In order to compute the status of this resource, we need to look at both the `Issuing`
and `Ready` conditions.

The resulting `Kustomization` object will look like this:

```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
name: application-kustomization
spec:
force: false
interval: 5m0s
path: ./overlays/application
prune: false
sourceRef:
kind: GitRepository
name: application-git
healthChecks:
- apiVersion: cert-manager.io/v1
kind: Certificate
name: service-certificate
namespace: cert-manager
- apiVersion: apps/v1
kind: Deployment
name: app
namespace: app
customHealthChecksExprs:
- apiVersion: cert-manager.io/v1
kind: Certificate
inProgress: "status.conditions.filter(e, e.type == 'Issuing').all(e, e.observedGeneration == metadata.generation && e.status == 'True')"
failed: "status.conditions.filter(e, e.type == 'Ready').all(e, e.observedGeneration == metadata.generation && e.status == 'False')"
current: "status.conditions.filter(e, e.type == 'Ready').all(e, e.observedGeneration == metadata.generation && e.status == 'True')"
```

The `HealthChecks` field still contains the objects that should be included in
the health assessment. The `CustomHealthChecksExprs` field will be used to declare
the `conditions` that need to be met in order to compute the status of the custom resource.

Note that all core resources are discarded from the `CustomHealthChecksExprs` field.


#### Provide an evaluator for `CEL` expressions for users

We will provide a CEL environment that can be used by the user to evaluate `CEL`
expressions. Users will use it to test their expressions before applying them to
their `Kustomization` object.

```shell
$ flux eval --api-version cert-manager.io/v1 --kind Certificate --in-progress "status.conditions.filter(e, e.type == 'Issuing').all(e, e.observedGeneration == metadata.generation && e.status == 'True')" --failed "status.conditions.filter(e, e.type == 'Ready').all(e, e.observedGeneration == metadata.generation && e.status == 'False')" --current "status.conditions.filter(e, e.type == 'Ready').all(e, e.observedGeneration == metadata.generation && e.status == 'True')" --file ./custom_resource.yaml
```

### User Stories

#### Configure custom health checks for a custom resource

> As a user of Flux, I want to be able to specify custom health checks for my
> custom resources, so that I can have more control over the status of my
> resources.

#### Enable health checks support in Flux for non-standard resources

> As a user of Flux, I want to be able to use the health check feature for
> non-standard resources, so that I can have more control over the status of my
> resources.

### Alternatives

We need an expression language that is flexible enough to cover all possible use
cases, without having to change `Flux` source code for each new use case.

On alternative that have been considered is to use `cuelang` instead of `CEL`.
`cuelang` is a more powerful expression language, but it is also more complex and
requires more work to integrate with `Flux`. it also does not have any support in
`Kubernetes` yet while `CEL` is already used in `Kubernetes` and libraries are
available to use it.

## Design Details

### Introduce a new field `CustomHealthChecksExprs` in the `Kustomization` CRD

The `api/v1/kustomization_types.go` file will be updated to add the `CustomHealthChecksExprs`
field to the `KustomizationSpec` struct.

```go
type KustomizationSpec struct {
...
// A list of resources to be included in the health assessment.
// +optional
HealthChecks []meta.NamespacedObjectKindReference `json:"healthChecks,omitempty"`

// A list of custom health checks expressed as CEL expressions.
// The CEL expression must evaluate to a boolean value.
// +optional
CustomHealthChecksExprs []CustomHealthCheckExprs `json:"customHealthChecksExprs,omitempty"`
...
}

// CustomHealthCheckExprs defines the CEL expressions for custom health checks.
// The CEL expressions must evaluate to a boolean value. The expressions are used
// to determine the status of the custom resource.
type CustomHealthCheckExprs struct {
// apiVersion of the custom health check.
// +required
APIVersion string `json:"apiVersion"`
// Kind of the custom health check.
// +required
Kind string `json:"kind"`
// InProgress is the CEL expression that verifies that the status
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that it would be nice to here have a Name field, to allow definitng a healthCheck applicable to a specific resource, with a finer grained granularity than the CRD.

The kind of scenario where this would be relevant would be situations where a Kustomization would produce multiple resources of the same kind but the right healthcheck to do would be different among those resources.

An optional Name field would be interpreted as: the healthCheck to use for a resource of Kind Y and Name X, would be the item in CustomHealthCheckExprs that specifies Kind Y and Name X if any, or if there is none an item for Kind Y and Name unspecified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health check expression is common for all instances of a Kind version, as it targets the status conditions which are the same for all. Can you give an example where you would write different CEL expressions for the same Kind?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will gladly admit not having any exact use case in mind.

The idea expressed here might be one, though, but whether or not it's relevant is certainly debatable. The intersection of "I want this Deployment to be healhcheck'd in a specific way" and "my Kustomization produces multiple Deploments" is perhaps empty.

From a user standpoint, it simply feels like "if this is going to be extensible, why not make it fully flexible". But from an implementation and code maintenance standpoint, I would understand that this idea might be perceived as over-engineering.

I just wanted to share this to see how relevant and implementable the idea might be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you group objects based on expected behaviour? Or chain Kustomizations?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the slow feedback

yes @souleb there would be workarounds

and this is overall a very useful evolution to what we have today which works pretty well for many cases already, so the wise path is of course, to keep it simple

// of the custom resource is in progress.
// +optional
InProgress string `json:"inProgress"`
// Failed is the CEL expression that verifies that the status
// of the custom resource is failed.
// +optional
Failed string `json:"failed"`
// Current is the CEL expression that verifies that the status
// of the custom resource is ready.
// +optional
Current string `json:"current"`
}
```

### Introduce a generic custom status reader

Introduce a generic custom status reader that will be able to compute the status of
a custom resource based on a list of `conditions` that need to be met.

```go
import (
"k8s.io/apimachinery/pkg/runtime/schema"
"sigs.k8s.io/cli-utils/pkg/kstatus/polling/engine"
"sigs.k8s.io/cli-utils/pkg/kstatus/polling/event"
kstatusreaders "sigs.k8s.io/cli-utils/pkg/kstatus/polling/statusreaders"
)
type customGenericStatusReader struct {
genericStatusReader engine.StatusReader
gvk schema.GroupVersionKind
}

func NewCustomGenericStatusReader(mapper meta.RESTMapper, gvk schema.GroupVersionKind, exprs map[string]string) engine.StatusReader {
genericStatusReader := kstatusreaders.NewGenericStatusReader(mapper, genericConditions(gvk.Kind, exprs))
return &customJobStatusReader{
genericStatusReader: genericStatusReader,
gvk: gvk,
}
}

func (g *customGenericStatusReader) Supports(gk schema.GroupKind) bool {
return gk == g.gvk.GroupKind()
}

func (g *customGenericStatusReader) ReadStatus(ctx context.Context, reader engine.ClusterReader, resource object.ObjMetadata) (*event.ResourceStatus, error) {
return g.genericStatusReader.ReadStatus(ctx, reader, resource)
}

func (g *customGenericStatusReader) ReadStatusForObject(ctx context.Context, reader engine.ClusterReader, resource *unstructured.Unstructured) (*event.ResourceStatus, error) {
return g.genericStatusReader.ReadStatusForObject(ctx, reader, resource)
}
```

A `genericConditions` closure will takes a `kind` and a map of `CEL` expressions as parameters
and returns a function that takes an `Unstructured` object and returns a `status.Result` object.

````go
import (
"sigs.k8s.io/cli-utils/pkg/kstatus/status"
"github.com/fluxcd/pkg/runtime/cel"
"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
)

func genericConditions(kind string, exprs map[string]string) func(u *unstructured.Unstructured) (*status.Result, error) {
return func(u *unstructured.Unstructured) (*status.Result, error) {
obj := u.UnstructuredContent()

for statusKey, expr := range exprs {
// Use CEL to evaluate the expression
result, err := cel.ProcessExpr(expr, obj)
if err != nil {
return nil, err
}
switch statusKey {
case status.CurrentStatus.String():
// If the expression evaluates to true, we return the current status
case status.FailedStatus.String():
// If the expression evaluates to true, we return the failed status
case status.InProgressStatus.String():
// If the expression evaluates to true, we return the reconciling status
}
}
}
}
````

The generic status reader will be used by the `statusPoller` provided to the `reconciler`
to compute the status of the resources for the registered custom resources `kind`.

We will provide a `CEL` environment that will use the Kubernetes CEL library to
evaluate the `CEL` expressions.

### StatusPoller configuration

The `reconciler` holds a `statusPoller` that is used to compute the status of the
resources during the `healthCheck` phase of the reconciliation. The `statusPoller`
is configured with a list of `statusReaders` that are used to compute the status
of the resources.

The `statusPoller` is not configurable once instantiated. This means
that we cannot add new `statusReaders` to the `statusPoller` once it is created.
This is a problem for custom resources because we need to be able to add new
`statusReaders` for each new custom resource that is declared in the `Kustomization`
object's `customHealthChecksExprs` field. Fortunately, the `cli-utils` library has
been forked in the `fluxcd` organization and we can make a change to the `statusPoller`
exposed the `statusReaders` field so that we can add new `statusReaders` to it.


The `statusPoller` used by `kustomize-controller` will be updated for every reconciliation
in order to add new polling options for custom resources that have a `CustomHealthChecksExprs`
field defined in their `Kustomization` object.

### K8s CEL Library

The `K8s CEL Library` is a library that provides `CEL` functions to help in evaluating
`CEL` expressions on `Kubernetes` objects.

Unfortunately, this means that we will need to follow the `K8s CEL Library` releases
in order to make sure that we are using the same version of the `CEL` library as
`Kubernetes`. As of the time of writing this RFC, the `K8s CEL Library` is using the
`v0.16.1` version of the `CEL` library while the latest version of the `CEL` library
is `v0.18.2`. This means that we will need to use the `v0.16.1` version of the `CEL`
library in order to be able to use the `K8s CEL Library`.


## Implementation History

See current POC implementation under https://github.com/souleb/kustomize-controller/tree/cel-based-custom-health