Process events instantly and consistently, stop skipping the events due to "batching" #844
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR: Process the events/changes asap, with no artificial delays (even 0.1s), stop skipping some of the events under high load, prevent double-execution if 3rd parties also patch the resources.
Background
A long time ago, at the dawn of Kopf (5a0b2da, #42, #43), the arrived events for a resource objects were packed into "batches". Only the last event that arrived within the 0.1-second window was processed, all preceding events were ignored.
Originally, it was made to address an issue in the official
kubernetes
client (which was later replaced bypykube-ng
, which in turn was later replaced by raw HTTP queries via aiohttp).Besides, though not documented, this short time window ensured consistency in some rare cases when a resource was patched by 3rd parties while being processed by Kopf: once Kopf performed its patch, it instantly got both events from the 3rd party and from itself, and only processed the latest state with its own annotations.
Problems
The time-based approach to consistency led to several negative effects noticeable on slow networks between the operator and apiservers (e.g. when operators are executed not in the cluster) or under high load (e.g. with too many resources or too many modifications of several resources).
@on.event
handlers were not executed for valuable but intermediate events — because they were packed into batches and discarded in favour of the latest event only.These effects were reported and investigated in #729, also directly or indirectly mentioned in #718, #732, #784 (comment). (This PR supersedes and closes #829.)
Besides, code-wise:
Double-processing
In the mentioned cases (slow networks and/or high load), the version patched by Kopf could arrive later than 0.1 seconds after the version patched by the 3rd party. As a result, the batch was not formed and these events were processed separately. In turn, since the intermediate state of the 3rd-party version did not contain Kopf's annotations about the successful processing of the resource, the handlers were re-executed (double-executed). And only after Kopf's patched version arrived, the operator went idle as expected.
Here is how it happened on the timeline, visually, with an artificial delay of 3 seconds:
Solution
The proposed solution introduces a wait for consistency of the resource after it is patched: since the
PATCH
operation returns the patched resource, we can get its resource version and expect this version in the watch-stream. All states arrived before the expected version are considered inconsistent and thus are not processed at least for the high-level state-dependent handlers.This is how it looks on the timeline with the same artificial delay of 3 seconds:
In case the expected version does not arrive within some reasonable time window (5-10 seconds), assume that it will not arrive at all and reset the consistency waiting as if the consistency is reached (even if not). This is a rare case, mostly impossible, and is needed only as a safe-guard: it is better to double-process the resource and cause side-effects than to cease its processing forever.
Time-based batching is removed completely as outdated and not adding any benefit.
User effects
As a result of this fix, all mentioned problems are addressed:
@on.event()
, indexed and passed to daemons.TODOs