From 2be03c0d606d82365b8eea19d151253001481f10 Mon Sep 17 00:00:00 2001
From: Abu Kashem <akashem@redhat.com>
Date: Tue, 5 Nov 2024 10:02:34 -0500
Subject: [PATCH] update kep-3926, how we address client cache inconsistency
 issue

---
 .../README.md                                 | 125 +++++++++++++++++-
 1 file changed, 124 insertions(+), 1 deletion(-)

diff --git a/keps/sig-auth/3926-handling-undecryptable-resources/README.md b/keps/sig-auth/3926-handling-undecryptable-resources/README.md
index c2e16960fcf..8e98a6511c4 100644
--- a/keps/sig-auth/3926-handling-undecryptable-resources/README.md
+++ b/keps/sig-auth/3926-handling-undecryptable-resources/README.md
@@ -93,6 +93,7 @@ tags, and then generate with `hack/update-toc.sh`.
     - [Retrieving a failing resource](#retrieving-a-failing-resource)
     - [Deleting a failing resource](#deleting-a-failing-resource)
     - [Protecting unconditional deletion](#protecting-unconditional-deletion)
+    - [Propagating with WATCH](#propagating-with-watch)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
@@ -405,7 +406,118 @@ deletions should therefore have their own extra admission.
 
 The unconditional deletion admission:
 1. checks if a "delete" request contains the `IgnoreStoreReadErrorWithClusterBreakingPotential` option
-2. if it does, it checks the RBAC of the request's user for the `delete-ignore-read-errors` verb of the given resource
+2. if it does, it checks the RBAC of the request's user for the `unsafe-delete-ignore-read-errors` verb of the given resource
+
+
+
+#### Propagating with WATCH
+When a corrupt object is deleted with an `unsafe-delete-ignore-read-errors`, a
+watcher of this resource will not be able to transform its old data from the
+storage, or decode it into an object [1]. This will cause the watch to throw
+an `watch.ERROR` event.
+A client backed up by an informer already has the object in its cache, since the
+client never receives a `watch.DELETED` event the object remains in the lsiter
+cache. This creates an inconsistency - retrieving the object from the cache
+yields the object, but if we get it from the storage we see a `corrupt object`
+error.
+
+There are a few factors we need to consider to understan how the client is impacted:
+- a) a client that has not update yet
+- b) whether watch cache is enabled in the kube-apiserver
+
+We can review the following options:
+
+*A: Send an `ERROR` event with a `metav1.Status` object*:
+When the `watcher` receives a `DELETED` event from etcd, and it fails to
+transform or decode the old data associated with the deleted resource, it
+will return an `ERROR` event with a `metav1.Status` object with a reason
+`StatusReasonStoreReadError`:
+```
+{
+    type: "ERROR",
+    object: {
+      "status": {
+        "message":"corrupt object has been deleted - data from the storage is not transformable - ...",
+		"reason": "StorageReadError"
+	 }
+  }
+}
+```
+
+Upon receiving this error, the existing reflector implementation should end the
+current watch and restart with a new list+watch which will result in
+the old cache being replaced.
+
+Pros:
+- existing clients will recover without any code change
+- No changes in watch API
+Cons: relisting is expensive, but it is limted to client(s) that are watching
+the resource(s) being unsafe-deleted.
+
+
+*B: Send an `ERROR` event with a partial object of the type being watched*:
+We will change the watch to throw an `watch.ERROR` with a partial object of the
+type being watched that has the follwoing information:
+- Namespace/Name: identifies the object
+- ResourceVersion: it is the Revision of the key-value store after the Delete operation
+- Annotation: `"unsafe-delete-ignore-read-errors" : "true"` to indicate that the
+  object associated with the DELETED event from the storage was corrupt.
+
+
+We will change the reflector to interpret this error, and do the following:
+- delete the object from its store so clients are in sync with the current state
+- set the `LastSyncResourceVersion` to the `ResourceVersion` from the partial
+  object, this advances to the revision of the storage after the delete operation
+
+This is what it would look in the wire:
+```
+ {
+    type: "ERROR",
+    object: {
+      "ObjectMeta": {
+        "Name":"foo",
+        "Namespace":"test-ns",
+        "ResourceVersion":"3",
+		 "Annotations": {
+           "kubernetes.io/unsafe-delete-ignore-read-errors":"true"
+         }
+	 }
+  }
+}
+```
+
+Pros: new client recovers and catches up with the store without a restart
+Cons: old client behaves as it behaves today
+
+[1] https://github.com/kubernetes/kubernetes/blob/2bb886ce2a6ae5a83031efd225e8a17dd7ecba03/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L681-L697
+[2] https://github.com/kubernetes/kubernetes/pull/127513/commits/46c3fdc4eeda8d3ffb1839fff196c5fece75066d#diff-9ccdf713e010f73dbebd01e936cb0077fc63e4f5ab941d865ded42da219d84ec
+
+
+*C: Send a `DELETED` event with a partial object of the type being watched*:
+Today the semantics of a `DELETED` event is:
+- 1) The `Object` associated with the event is the whole object as it was.
+- 2) The `ResourceVersion` of the object is the revision of the storage
+     after the delete operation.
+
+Upon faliure to transform or decode the object data, if we return a `DELETED`
+event with a partial object similar to `B` then respect `2` but not `1`.
+It's worth noting that a client can define keys based off of fields that are 
+limited to `Namespace` or `Name`, e.g. someone could define a key from a value 
+from a label or an annotation.
+
+Pros: existing clients work as expected without any code change
+Cons: partial object can cause a client that uses custom keys to fail
+
+
+*D: Introduce a new watch event*:
+This seems more intrusive than necessary, so i will not go in details here.
+
+
+
+Add an integration test to show that an informer is consistent with the state 
+in the storage after the corrupt resource is unsafe-deleted.
+
+
 
 ### Test Plan
 
@@ -491,6 +603,17 @@ At this time only integration tests are considered.
 
 ### Graduation Criteria
 
+#### Beta
+
+- API validation must be resilient to future additions to the API
+- unfortuately dynamic encryption config reload takes about 1m, so can't use wait.ForeverTestTimeout 
+  in the integration test yet, I have left to TODO to investigate and improve the reload time
+- Allow `DryRun` together with `ignoreStoreReadErrorWithClusterBreakingPotential`:
+
+We will check if the user has the permission to do `unsafe-delete-ignore-read-errors`
+on the resource
+
+
 <!--
 **Note:** *Not required until targeted at a release.*