From 708ae1b855a43d8a3c4953f92dc38a97b7cbb56f Mon Sep 17 00:00:00 2001 From: "Dr. Stefan Schimanski" Date: Wed, 16 Aug 2023 06:57:12 -0500 Subject: [PATCH] WIP: docs: add a sharding overview --- docs/content/concepts/sharding.md | 204 ++++++++++++++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 docs/content/concepts/sharding.md diff --git a/docs/content/concepts/sharding.md b/docs/content/concepts/sharding.md new file mode 100644 index 00000000000..0eaaa6918db --- /dev/null +++ b/docs/content/concepts/sharding.md @@ -0,0 +1,204 @@ +--- +description: > + Horizontal Scaling of kcp through sharding +--- + +# Shards + +Every kcp shard is hosting a set of logical clusters. A logical cluster is +identified by a globally unique identifier called a *cluster name*. A shard +serves the logical clusters under `/clusters/`. + +A set of known shards comprises a kcp installation. + +# Root Logical Cluster + +Every kcp installation has one logical cluster called `root`. The `root` logical +cluster holds administrational objects. Among them are shard objects. + +# Shard Objects + +The set of shards in a kcp installation is defined by `Shard` objects in +`core.kcp.io/v1alpha1`. + +A shard object specifies the network addresses, one for external access (usually +some worldwide load balancer) and one for direct access (shard to shard). + +# Logical Clusters and Workspace Paths + +Logical clusters are defined through the existence of a `LogicalCluster` object +"in themselves", similar to a `.` directory defining the existence of a directory +in Unix. + +Every logical cluster name `name` is a *logical cluster path*. Every logical +cluster is reachable through `/cluster/` for every of their paths. + +A `Workspace` in `tenancy.kcp.io/v1alpha1` of name `name` references a logical +cluster specified in `Workspace.spec.cluster`. If that workspace object resided +in a logical cluster reachable via a path `path`, the referenced logical cluster +can be reached via a path `path:name`. + +# Canonical Paths + +The longest path a logical cluster is reachable under is called the *canonical +path*. By default, all canonical paths start with `root`, i.e. they start in +root logical cluster. + +The logical cluster object annotated with `kcp.io/path: `. + +Additional subtrees of the workspace path hierarchy can be defined by creating +logical clusters with `kcp.io/path` annotation *not* starting in `root`. E.g. +a home workspace hierarchy could start at `home:`. There is no need +for the parent (`home` in this case) to exist. + +# Front Proxy + +A front-proxy is aware of all logical clusters, their shard they live on, +their canonical paths and all `Workspaces`s. Non canoical paths can be +reconstructed from the canonical path prefixes and the worksapce names. + +Requests to `/cluster/` are forwarded to the shard via inverse proxying. + +A front-proxy in its most simplistic (toy) implementation watches all shards. +Clearly, that's neither feasible for scale nor good for availability. A front- +proxy could alternative be backed by any kind of external database with the +right scalability and availability properties. + +There can be one front-proxy in front of a kcp installation, or many, e.g. one +or multiple per region or cloud provider. + +# Consistency Domain + +Every logical cluster provides a Kubernetes-compatible API root endpoint under +`/cluster/` including its own discovery endpoint and their own set of +API groups and resources. + +Resources under such an endpoints satisfy the same consistency properties as with +a full Kubernetes cluster, e.g. the semantics of the resource versions of one +resource matches that of Kubernetes. + +Across logical clusters the resource versions must be considered as unrelated, +i.e. resource versions cannot be compared. + +# Wildcard Requests + +The only exception to the upper rule are objects under a "wildcard endpoint" +`/clusters/*/apis///[namespaces/]/resource:` per +shard. It serves the objects of the given resource on that shard across +logical-clusters. The annotation `kcp.io/cluster` tells the consumer which +logical cluster each object belongs to. + +The wildcard endpoint is privileged (requires `system:masters` group membership). +It is only accessible when talking directly to a shard, not through a +front-proxy. + +Note: for unprivileged access, virtual view apiservers can offer a highly +secured and filtered view, usually also per shard, e.g. for owners of APIs. + +# Cross Logical Cluster References + +Some objects reference other logical clusters or objects in other logical +clusters. These references can be by logical cluster name, by arbitrary logical +cluster path or by canonical path. + +Referenced logical clusters must be assumed to live on other shard, maybe even +on shard of different regions or cloud providers. + +For scalability reasons, it is usually not adequate to drive controllers by +informers that span all shard. + +In other words, logical involving cross-logical-cluster referenced have a cost +higher than in-logical-cluster references and logic must be carefully planned: + +For example, during workspace creation and scheduling the scheduler running on +the shard hosting the `Workspace` object will access another shard to create the +`LogicalCluster` object initially. It does that by choosing a random logical +cluster name (optimistically) and choosing a shard that name maps to (through +consistent hashing). It then tries to create the `LogicalCluster`. On conflict, +it can check whether the existing object belong the given `Workspace` object or +not. If not, another name and shard is chosen, until scheduling succeeds. During +initialization the controller on the `Workspace` hosting shard will keep watching +the logical cluster on the other shard, with some exponential backoff. In other +words, the `Workspace` hosting shard does not continuously watch the object on +the other shard. + +Another example is API binding, but it is different than workspace scheduling: +a binding controller running on the shard hosting the `APIBinding` object will +be aware of all `APIExport`s in the kcp installation through caching replication +(see next section). What is special is that this controller has all the +information necessary to bind a new API and to keep bound APIs working even if +the shard of the `APIExport` is unavailable. + +Note: usually it a bad idea to create logic dependent on the parent workspace. If +such logic is desired, for availability and scalability reasons some kind of +replication is required. E.g. a child workspace must stay operation even if the +parent is not accessible. + +# Cache Server Replication + +The cache server is a special API server that can hold replicas of objects that +must be available globally in an eventual consistent way. E.g. the `APIExport`s +and `APIResourceSchemas` are replicated that way and made available to the +corresponding controllers via informers. + +The cache server holds objects by logical clusters, and it can hold objects from +many or all shards in a kcp installation, served through wildcard informers. +The resource versions of those objects have no meaning beyond driving the cache +informers running in the shards. + +Cache servers can be 1:1 with shards, or there can be shared cache servers, e.g. +by region. + +Cache servers can form a cache hierarchy. + +Controllers that make use of cached objects, will usually have informers against +local objects and against the same objects in the cache server. If the former +returns a "NotFound" error, the controllers will look up in the cache informers. + +The cache server technique is only useful for APIs whose object cardinality +across all shards does not go beyond the cardinality sensibly storable in a +kube-based apiserver. + +Note that objects like `Workspace`s and `LogicalCluster`s fall not into that +category. This means that in particular the logical cluster canonical path +cannot be derived from cached `LogicalCluster`s. Instead, the cached objects +must hold their own `kcp.io/path` annotation in order to be indexable by that +value. This is crucial to implement cross-logical-cluster references by +canonical path. + +Note: the `APIExport` example assumes that there are never more than e.g. 10,000 +API exports in a kcp installation. If that is not an acceptable constraint, +other partitioning mechanism would be need to hold the number of `APIExport` +objects per cache server below the critical number. E.g. there could be cache +servers per big tenant, and that would hold only public exports and +tenant-internal exports. A more complex caching hierarchy would make sure the +right objects are replicated, while the "really public" exports would only be a +small number. + +# Replication + +Each shard pushes a restricted set of objects to the cache server. As the cache +server replication is costly, this set is as minimal as possible. For example, +certain RBAC objects are replicated in case they are needed to successfully +authorize bindings of an API, or to use a workspace type. + +By the nature of replication, objects in the cache server can be old and +incomplete. For instance, the non-existence of an object in the cache server +does not mean it does not exist in its respective shard. The replication +could be just delayed or the object was not identified to be worth to replicate. + +# Bootstrapping + +A new shard starting up will run a number of standard controllers (e.g. for +workspaces, API bindings and more). These will need a number of standard +informers both watching objects locally on that shard through wildcard informers +and watching the corresponding cache server. + +Wildcard informers require the `APIExport` identity. This identity varies by +installation as it is cryptographically created during creation of the `APIExport`. + +The core and tenancy APIs have their `APIExport` in the root logical cluster. +A new shard will connect to that root logical cluster in order to extract the +identies for these APIs. It will cache these for later use in case of a network +partition or unavailability of the root shard. After retrieving these identities +the informers can be started.