diff --git a/CODE-OF-CONDUCT.md b/CODE-OF-CONDUCT.md index 077c12368..c622543c8 100644 --- a/CODE-OF-CONDUCT.md +++ b/CODE-OF-CONDUCT.md @@ -1,3 +1,3 @@ -# The NRI Plugin Collection Project Community Code of Conduct +# The NRI Plugins Project Community Code of Conduct -The NRI Plugin Collection Project follows the [Containers Community Code of Conduct](https://github.com/containers/common/blob/main/CODE-OF-CONDUCT.md). +The NRI Plugins Project follows the [Containers Community Code of Conduct](https://github.com/containers/common/blob/main/CODE-OF-CONDUCT.md). diff --git a/README.md b/README.md index 2a048c0a3..ef6c74454 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# NRI Plugin Collection +# NRI Plugins This repository contains a collection of community maintained NRI plugins. @@ -9,4 +9,4 @@ Currently following plugins are available: | [Topology Aware][1] | resource policy | | [Balloons][1] | resource policy | -[1]: http://github.com/containers/nri-plugins/blob/main/docs/README-resource-policy.md +[1]: http://github.com/containers/nri-plugins/blob/main/docs/resource-policy/README.md diff --git a/SECURITY.md b/SECURITY.md index 6d7c62b19..8b709c43c 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -1,4 +1,4 @@ -# Security and Disclosure Information Policy for the NRI Plugin Collection Project +# Security and Disclosure Information Policy for the NRI Plugins Project * [Reporting a Vulnerability](#Reporting-a-Vulnerability) * [Security Announcements](#Security-Announcements) @@ -6,7 +6,7 @@ ## Reporting a Vulnerability -If you think you've identified a security issue in a NRI Plugin Collection project, +If you think you've identified a security issue in a NRI Plugins project, please DO NOT report the issue publicly via the Github issue tracker, mailing list, or IRC. Instead, send an email with as many details as possible to [cncf-crio-security@lists.cncf.io](mailto:cncf-crio-security@lists.cncf.io?subject=Security%20Vunerablity%20Report) or [security@containerd.io](mailto:security@containerd.io?subject=Security%20Vunerablity%20Report). diff --git a/docs/_templates/layout.html b/docs/_templates/layout.html index 2f2db268d..0e68b46ac 100644 --- a/docs/_templates/layout.html +++ b/docs/_templates/layout.html @@ -14,7 +14,7 @@
- - -
- -## Components - -### [Node Agent](/pkg/agent/) - -The node agent is a component external to CRI-RM itself. All interactions -by CRI-RM with the Kubernetes Control Plane go through the node agent with -the node agent performing any direct interactions on behalf of CRI-RM. - -The node agent communicates with NRI-RM via goroutine. The API is used to: - - push updated external configuration data to CRI-RM - -The agent interface implements the following: - - updating resource capacity of the node - - getting, setting, or removing labels on the node - - getting, setting, or removing annotations on the node - - getting, setting, or removing taints on the node - -The config interface is defined and has its gRPC server running in -CRI-RM. The agent acts as a gRPC client for this interface. The low-level -cluster interface is defined and has its gRPC server running in the agent, -with the [convenience layer](/pkg/agent) defined in -CRI-RM. CRI-RM acts as a gRPC client for the low-level plumbing interface. - -Additionally, the stock node agent that comes with CRI-RM implements schemes -for: - - configuration management for all CRI-RM instances - - management of dynamic adjustments to container resource assignments - -- - -
- - -### [Resource Manager](/pkg/resmgr/) - -CRI-RM implements a request processing pipeline and an event processing -pipeline. -The request processing pipeline takes care of proxying CRI requests and -responses between CRI clients and the CRI runtime. The event processing -pipeline processes a set of other events that are not directly related -to or the result of CRI requests. These events are typically internally -generated within CRI-RM. They can be the result of changes in the state -of some containers or the utilization of a shared system resource, which -potentially could warrant an attempt to rebalance the distribution of -resources among containers to bring the system closer to an optimal state. -Some events can also be generated by policies. - -The Resource Manager component of CRI-RM implements the basic control -flow of both of these processing pipelines. It passes control to all the -necessary sub-components of CRI-RM at the various phases of processing a -request or an event. Additionally, it serializes the processing of these, -making sure there is at most one (intercepted) request or event being -processed at any point in time. - -The high-level control flow of the request processing pipeline is as -follows: - -A. If the request does not need policying, let it bypass the processing -pipeline; hand it off for logging, then relay it to the server and the -corresponding response back to the client. - -B. If the request needs to be intercepted for policying, do the following: - 1. Lock the processing pipeline serialization lock. - 2. Look up/create cache objects (pod/container) for the request. - 3. If the request has no resource allocation consequences, do proxying - (step 6). - 4. Otherwise, invoke the policy layer for resource allocation: - - Pass it on to the configured active policy, which will - - Allocate resources for the container. - - Update the assignments for the container in the cache. - - Update any other containers affected by the allocation in the cache. - 5. Invoke the controller layer for post-policy processing, which will: - - Collect controllers with pending changes in their domain of control - - for each invoke the post-policy processing function corresponding to - the request. - - Clear pending markers for the controllers. - 6. Proxy the request: - - Relay the request to the server. - - Send update requests for any additional affected containers. - - Update the cache if/as necessary based on the response. - - Relay the response back to the client. - 7. Release the processing pipeline serialization lock. - -The high-level control flow of the event processing pipeline is one of the -following, based on the event type: - - - For policy-specific events: - 1. Engage the processing pipeline lock. - 2. Call policy event handler. - 3. Invoke the controller layer for post-policy processing (same as step 5 for requests). - 4. Release the pipeline lock. - - For metrics events: - 1. Perform collection/processing/correlation. - 2. Engage the processing pipeline lock. - 3. Update cache objects as/if necessary. - 4. Request rebalancing as/if necessary. - 5. Release pipeline lock. - - For rebalance events: - 1. Engage the processing pipeline lock. - 2. Invoke policy layer for rebalancing. - 3. Invoke the controller layer for post-policy processing (same as step 5 for requests). - 4. Release the pipeline lock. - - -### [Cache](/pkg/cache/) - -The cache is a shared internal storage location within CRI-RM. It tracks the -runtime state of pods and containers known to CRI-RM, as well as the state -of CRI-RM itself, including the active configuration and the state of the -active policy. The cache is saved to permanent storage in the filesystem and -is used to restore the runtime state of CRI-RM across restarts. - -The cache provides functions for querying and updating the state of pods and -containers. This is the mechanism used by the active policy to make resource -assignment decisions. The policy simply updates the state of the affected -containers in the cache according to the decisions. - -The cache's ability to associate and track changes to containers with -resource domains is used to enforce policy decisions. The generic controller -layer first queries which containers have pending changes, then invokes each -controller for each container. The controllers use the querying functions -provided by the cache to decide if anything in their resource/control domain -needs to be changed and then act accordingly. - -Access to the cache needs to be serialized. However, this serialization is -not provided by the cache itself. Instead, it assumes callers to make sure -proper protection is in place against concurrent read-write access. The -request and event processing pipelines in the resource manager use a lock to -serialize request and event processing and consequently access to the cache. - -If a policy needs to do processing unsolicited by the resource manager, IOW -processing other than handling the internal policy backend API calls from the -resource manager, then it should inject a policy event into the resource -managers event loop. This causes a callback from the resource manager to -the policy's event handler with the injected event as an argument and with -the cache properly locked. - - -### [Generic Policy Layer](/pkg/policy/policy.go) - -The generic policy layer defines the abstract interface the rest of CRI-RM -uses to interact with policy implementations and takes care of the details -of activating and dispatching calls through to the configured active policy. - - -### [Generic Resource Controller Layer](/pkg/control/control.go) - -The generic resource controller layer defines the abstract interface the rest -of CRI-RM uses to interact with resource controller implementations and takes -care of the details of dispatching calls to the controller implementations -for post-policy enforcment of decisions. - - -### [Metrics Collector](/pkg/metrics/) - -The metrics collector gathers a set of runtime metrics about the containers -running on the node. CRI-RM can be configured to periodically evaluate this -collected data to determine how optimal the current assignment of container -resources is and to attempt a rebalancing/reallocation if it is deemed -both possible and necessary. - - -### [Policy Implementations](/cmd/) - -#### [Topology Aware](/cmd/topology-aware/) - -A topology-aware policy capable of handling multiple tiers/types of memory, -typically a DRAM/PMEM combination configured in 2-layer memory mode. - -#### [Balloons](/cmd/balloons/) - -A balloons policy allows user to define fine grained control how the -computer resources are distributed to workloads. - -### [Resource Controller Implementations](/pkg/control/) - -#### [Intel RDT](/pkg/control/rdt/) - -A resource controller implementation responsible for the practical details of -associating a container with Intel RDT classes. This class effectively -determines how much last level cache and memory bandwidth will be available -for the container. This controller uses the resctrl pseudo filesystem of the -Linux kernel for control. - -#### [Block I/O](/pkg/control/blockio/) - -A resource controller implementation responsible for the practical details of -associating a container with a Block I/O class. This class effectively -determines how much Block I/O bandwidth will be available for the container. -This controller uses the blkio cgroup controller and the cgroupfs pseudo- -filesystem of the Linux kernel for control. - -#### [CRI](/pkg/control/cri/) - -A resource controller responsible for modifying intercepted CRI container -creation requests and creating CRI container resource update requests, -according to the changes the active policy makes to containers. diff --git a/docs/developers-guide/e2e-test.md b/docs/developers-guide/e2e-test.md deleted file mode 100644 index 560a97911..000000000 --- a/docs/developers-guide/e2e-test.md +++ /dev/null @@ -1,212 +0,0 @@ -# End-to-End tests - -## Prerequisites - -Install: -- `docker` -- `govm` - In case of errors in building `govm` with `go get`, or creating a virtual machine (`Error when creating the new VM: repository name must be canonical`), these are the workarounds: - ``` - GO111MODULE=off go get -d github.com/govm-project/govm && cd $GOPATH/src/github.com/govm-project/govm && go mod tidy && go mod download && go install && cd .. && docker build govm -f govm/Dockerfile -t govm/govm:latest - ``` - -## Usage - -Run policy tests: - -``` -[VAR=VALUE...] ./run_tests.sh policies -``` - -Run tests only on certain policy, topology, or only selected test: - -``` -[VAR=VALUE...] ./run_tests.sh policies[/POLICY[/TOPOLOGY[/testNN-*]]] -``` - -Run custom tests: - -``` -[VAR=VALUE...] ./run.sh MODE -``` - -Get help on available `VAR=VALUE`'s with `./run.sh -help`. `run_tests.sh` calls `run.sh` in order to execute selected -tests. Therefore the same `VAR=VALUE` definitions apply both scripts. - -## Test phases - -In the *setup phase* `run.sh` creates a virtual machine unless it -already exists. When it is running, tests create a single-node cluster -and launches `cri-resmgr` on it, unless they are already running. - -In the *test phase* `run.sh` runs a test script, or gives a prompt -(`run.sh> `) asking a user to run test script commands in the -`interactive` mode. *Test scripts* are `bash` scripts that can use -helper functions for running commands and observing the status of the -virtual machine and software running on it. - -In the *tear down phase* `run.sh` copies logs from the virtual machine -and finally stops or deletes the virtual machine, if that is wanted. - -## Test modes - -- `test` mode runs fast and reports `Test verdict: PASS` or - `FAIL`. The exit status is zero if and only if a test passed. - -- `play` mode runs the same phases and scripts as the `test` mode, but - slower. This is good for following and demonstrating what is - happening. - -- `interactive` mode runs the setup and tear down phases, but instead - of executing a test script it gives an interactive prompt. - -Print help to see clean up, execution speed and other options for all -modes. - -## Running from scratch and quick rerun in existing virtual machine - -The test will use `govm`-managed virtual machine named in the `vm` -environment variable. The default is `crirm-test-e2e`. If a virtual -machine with that name exists, the test will be run on it. Otherwise -the test will create a virtual machine with that name from -scratch. You can delete a virtual machine with `govm delete NAME`. - -If you want rerun the test many times, possibly with different test -inputs or against different versions of `cri-resmgr`, either use the -`play` mode or set `cleanup=0` in order to keep the virtual machine -after each run. Then tests will run in the same single node cluster, -and the test script will only delete running pods before launching new -ones. - -## Testing locally built cri-resmgr and cri-resmgr from github - -If you make changes to `cri-resmgr` sources and rebuild it, you can -force the test script to reinstall newly built `cri-resmgr` to -existing virtual machine before rerunning the test: - -``` -cri-resource-manager$ make -cri-resource-manager$ cd test/e2e -e2e$ reinstall_cri_resmgr=1 speed=1000 ./run.sh play -``` - -You can also let the test script build `cri-resmgr` from the github -master branch. This takes place inside the virtual machine, so your -local git sources will not be affected: - -``` -e2e$ reinstall_cri_resmgr=1 binsrc=github ./run.sh play -``` - -## Custom tests - -You can run a custom test script in a virtual machine that runs -single-node Kubernetes\* cluster. Example: - -``` -$ cat > myscript.sh << EOF -# create two pods, each requesting two CPUs -CPU=2 n=2 create guaranteed -# create four pods, no resource requests -n=4 create besteffort -# show pods -kubectl get pods -# check that the first two pods are not allowed to use the same CPUs -verify 'cpus["pod0c0"].isdisjoint(cpus["pod1c0"])' -EOF -$ ./run.sh test myscript.sh -``` - -## Custom topologies - -If you change NUMA node topology of an existing virtual machine, you -must delete the virtual machine first. Otherwise the `topology` variable -is ignored and the test will run in the existing NUMA -configuration. - -The `topology` variable is a JSON array of objects. Each object -defines one or more NUMA nodes. Keys in objects: -``` -"mem" mem (RAM) size on each NUMA node in this group. - The default is "0G". -"nvmem" nvmem (non-volatile RAM) size on each NUMA node - in this group. The default is "0G". -"cores" number of CPU cores on each NUMA node in this group. - The default is 0. -"threads" number of threads on each CPU core. - The default is 2. -"nodes" number of NUMA nodes on each die. - The default is 1. -"dies" number of dies on each package. - The default is 1. -"packages" number of packages. - The default is 1. -``` - - -Example: - -Run the test in a VM with two NUMA nodes. There are 4 CPUs (two cores, two -threads per core by default) and 4G RAM in each node -``` -e2e$ govm delete my2x4 ; vm=my2x4 topology='[{"mem":"4G","cores":2,"nodes":2}]' ./run.sh play -``` - -Run the test in a VM with 32 CPUs in total: there are two packages -(sockets) in the system, each containing two dies. Each die containing -two NUMA nodes, each node containing 2 CPU cores, each core containing -two threads. And with a NUMA node with 16G of non-volatile memory -(NVRAM) but no CPUs. - -``` -e2e$ vm=mynvram topology='[{"mem":"4G","cores":2,"nodes":2,"dies":2,"packages":2},{"nvmem":"16G"}]' ./run.sh play -``` - -## Test output - -All test output is saved under the directory in the environment -variable `outdir`. The default is `./output`. - -Executed commands with their output, exit status and timestamps are -saved under the `output/commands` directory. - -You can find Qemu output from Docker\* logs. For instance, output of the -most recent Qemu launced by `govm`: -``` -$ docker logs $(docker ps | awk '/govm/{print $1; exit}') -``` - -## Manual testing and debugging - -Interactive mode helps developing and debugging scripts: - -``` -$ ./run.sh interactive -... -run.sh> CPU=2 n=2 create guaranteed -``` - -You can get help on functions available in test scripts with `./run.sh -help script`, or with `help` and `help FUNCTION` when in the -interactive mode. - -If a test has stopped to a failing `verify`, you can inspect -`cri-resmgr` cache and allowed OS resources in Python\* after the test -run: - -``` -$ PYTHONPATH=+ + +
+ +## Components + +### [Node Agent](/pkg/resmgr/agent/) + +The node agent is a component internal to NRI-RP itself. All interactions +by NRI-RP with the Kubernetes Control Plane go through the node agent with +the node agent performing any direct interactions on behalf of NRI-RP. + +The agent interface implements the following functionality: + - push updated external configuration data to NRI-RP + - updating resource capacity of the node + - getting, setting, or removing labels on the node + - getting, setting, or removing annotations on the node + - getting, setting, or removing taints on the node + +The config interface is defined and has its gRPC server running in +NRI-RP. The agent acts as a gRPC client for this interface. The low-level +cluster interface is defined and has its gRPC server running in the agent, +with the [convenience layer](/pkg/resmgr/agent) defined in NRI-RP. +NRI-RP acts as a gRPC client for the low-level plumbing interface. + +Additionally, the stock node agent that comes with NRI-RP implements schemes +for: + - configuration management for all NRI-RP instances + - management of dynamic adjustments to container resource assignments + + +### [Resource Manager](/pkg/resmgr/) + +NRI-RP implements an event processing pipeline. In addition to NRI events, +it processes a set of other events that are not directly related to or the +result of NRI requests. +These events are typically internally generated within NRI-RP. They can be +the result of changes in the state of some containers or the utilization +of a shared system resource, which potentially could warrant an attempt to +rebalance the distribution of resources among containers to bring the system +closer to an optimal state. Some events can also be generated by policies. + +The Resource Manager component of NRI-RP implements the basic control +flow of the processing pipeline. It passes control to all the +necessary sub-components of NRI-RP at the various phases of processing a +request or an event. Additionally, it serializes the processing of these, +making sure there is at most one request or event being processed at any +point in time. + +The high-level control flow of the request processing pipeline is as +follows: + +A. If the request does not need policying, let it bypass the processing +pipeline; hand it off for logging, then relay it to the server and the +corresponding response back to the client. + +B. If the request needs to be intercepted for policying, do the following: + 1. Lock the processing pipeline serialization lock. + 2. Look up/create cache objects (pod/container) for the request. + 3. If the request has no resource allocation consequences, do proxying + (step 6). + 4. Otherwise, invoke the policy layer for resource allocation: + - Pass it on to the configured active policy, which will + - Allocate resources for the container. + - Update the assignments for the container in the cache. + - Update any other containers affected by the allocation in the cache. + 5. Invoke the controller layer for post-policy processing, which will: + - Collect controllers with pending changes in their domain of control + - for each invoke the post-policy processing function corresponding to + the request. + - Clear pending markers for the controllers. + 6. Proxy the request: + - Relay the request to the server. + - Send update requests for any additional affected containers. + - Update the cache if/as necessary based on the response. + - Relay the response back to the client. + 7. Release the processing pipeline serialization lock. + +The high-level control flow of the event processing pipeline is one of the +following, based on the event type: + + - For policy-specific events: + 1. Engage the processing pipeline lock. + 2. Call policy event handler. + 3. Invoke the controller layer for post-policy processing (same as step 5 for requests). + 4. Release the pipeline lock. + - For metrics events: + 1. Perform collection/processing/correlation. + 2. Engage the processing pipeline lock. + 3. Update cache objects as/if necessary. + 4. Request rebalancing as/if necessary. + 5. Release pipeline lock. + - For rebalance events: + 1. Engage the processing pipeline lock. + 2. Invoke policy layer for rebalancing. + 3. Invoke the controller layer for post-policy processing (same as step 5 for requests). + 4. Release the pipeline lock. + + +### [Cache](/pkg/resmgr/cache/) + +The cache is a shared internal storage location within NRI-RP. It tracks the +runtime state of pods and containers known to NRI-RP, as well as the state +of NRI-RP itself, including the active configuration and the state of the +active policy. The cache is saved to permanent storage in the filesystem and +is used to restore the runtime state of NRI-RP across restarts. + +The cache provides functions for querying and updating the state of pods and +containers. This is the mechanism used by the active policy to make resource +assignment decisions. The policy simply updates the state of the affected +containers in the cache according to the decisions. + +The cache's ability to associate and track changes to containers with +resource domains is used to enforce policy decisions. The generic controller +layer first queries which containers have pending changes, then invokes each +controller for each container. The controllers use the querying functions +provided by the cache to decide if anything in their resource/control domain +needs to be changed and then act accordingly. + +Access to the cache needs to be serialized. However, this serialization is +not provided by the cache itself. Instead, it assumes callers to make sure +proper protection is in place against concurrent read-write access. The +request and event processing pipelines in the resource manager use a lock to +serialize request and event processing and consequently access to the cache. + +If a policy needs to do processing unsolicited by the resource manager, IOW +processing other than handling the internal policy backend API calls from the +resource manager, then it should inject a policy event into the resource +managers event loop. This causes a callback from the resource manager to +the policy's event handler with the injected event as an argument and with +the cache properly locked. + + +### [Generic Policy Layer](/pkg/resmgr/policy/policy.go) + +The generic policy layer defines the abstract interface the rest of NRI-RP +uses to interact with policy implementations and takes care of the details +of activating and dispatching calls through to the configured active policy. + + +### [Generic Resource Controller Layer](/pkg/resmgr/control/control.go) + +The generic resource controller layer defines the abstract interface the rest +of NRI-RP uses to interact with resource controller implementations and takes +care of the details of dispatching calls to the controller implementations +for post-policy enforcment of decisions. + + +### [Metrics Collector](/pkg/metrics/) + +The metrics collector gathers a set of runtime metrics about the containers +running on the node. NRI-RP can be configured to periodically evaluate this +collected data to determine how optimal the current assignment of container +resources is and to attempt a rebalancing/reallocation if it is deemed +both possible and necessary. + + +### [Policy Implementations](/cmd/) + +#### [Topology Aware](/cmd/topology-aware/) + +A topology-aware policy capable of handling multiple tiers/types of memory, +typically a DRAM/PMEM combination configured in 2-layer memory mode. + +#### [Balloons](/cmd/balloons/) + +A balloons policy allows user to define fine grained control how the +computer resources are distributed to workloads. + +#### [Template](/cmd/template/) + +The template policy can be used as a base for developing new policies. +It provides hooks that the policy developer can fill to define fine grained +control how the computer resources are distributed to workloads. +Do not edit the template policy directly but copy it to new name and edit that. diff --git a/docs/resource-policy/developers-guide/e2e-test.md b/docs/resource-policy/developers-guide/e2e-test.md new file mode 100644 index 000000000..06adb92f5 --- /dev/null +++ b/docs/resource-policy/developers-guide/e2e-test.md @@ -0,0 +1,118 @@ +# End-to-End tests + +## Prerequisites + +Install: +- `docker` +- `vagrant` + +## Usage + +Run policy tests: + +``` +cd test/e2e +[VAR=VALUE...] ./run_tests.sh policies.test-suite +``` + +Run tests only on certain policy, topology, or only selected test: + +``` +cd test/e2e +[VAR=VALUE...] ./run_tests.sh policies.test-suite[/POLICY[/TOPOLOGY[/testNN-*]]] +``` + +Get help on available `VAR=VALUE`'s with `./run.sh help`. +`run_tests.sh` calls `run.sh` in order to execute selected tests. +Therefore the same `VAR=VALUE` definitions apply both scripts. + +## Test phases + +In the *setup phase* `run.sh` creates a virtual machine unless it +already exists. When it is running, tests create a single-node cluster +and deploy `nri-resource-policy` DaemonSet on it. + +In the *test phase* `run.sh` runs a test script. *Test scripts* are +`bash` scripts that can use helper functions for running commands and +observing the status of the virtual machine and software running on it. + +In the *tear down phase* `run.sh` copies logs from the virtual machine +and finally stops or deletes the virtual machine, if that is wanted. + +## Test modes + +- `test` mode runs fast and reports `Test verdict: PASS` or + `FAIL`. The exit status is zero if and only if a test passed. + +Currently only the normal test mode is supported. + +## Running from scratch and quick rerun in existing virtual machine + +The test will use `vagrant`-managed virtual machine named in the +`vm_name` environment variable. The default name is constructed +from used topology, Linux distribution and runtime name. +If a virtual machine already exists, the test will be run on it. +Otherwise the test will create a virtual machine from scratch. +You can delete a virtual machine by going to the VM directory and +giving the command `make destroy`. + +## Custom topologies + +If you change NUMA node topology of an existing virtual machine, you +must delete the virtual machine first. Otherwise the `topology` variable +is ignored and the test will run in the existing NUMA +configuration. + +The `topology` variable is a JSON array of objects. Each object +defines one or more NUMA nodes. Keys in objects: +``` +"mem" mem (RAM) size on each NUMA node in this group. + The default is "0G". +"nvmem" nvmem (non-volatile RAM) size on each NUMA node + in this group. The default is "0G". +"cores" number of CPU cores on each NUMA node in this group. + The default is 0. +"threads" number of threads on each CPU core. + The default is 2. +"nodes" number of NUMA nodes on each die. + The default is 1. +"dies" number of dies on each package. + The default is 1. +"packages" number of packages. + The default is 1. +``` + + +Example: + +Run the test in a VM with two NUMA nodes. There are 4 CPUs (two cores, two +threads per core by default) and 4G RAM in each node +``` +e2e$ vm_name=my2x4 topology='[{"mem":"4G","cores":2,"nodes":2}]' ./run.sh +``` + +Run the test in a VM with 32 CPUs in total: there are two packages +(sockets) in the system, each containing two dies. Each die containing +two NUMA nodes, each node containing 2 CPU cores, each core containing +two threads. And with a NUMA node with 16G of non-volatile memory +(NVRAM) but no CPUs. + +``` +e2e$ vm_name=mynvram topology='[{"mem":"4G","cores":2,"nodes":2,"dies":2,"packages":2},{"nvmem":"16G"}]' ./run.sh +``` + +## Test output + +All test output is saved under the directory in the environment +variable `outdir` if the `run.sh` script is executed as is. The default +output directory in this case is `./output`. + +For the standard e2e-tests run by `run_tests.sh`, the output directory +is constructed from used Linux distribution, container runtime name and +the used machine topology. +For example `n4c16-generic-fedora37-containerd` output directory would +indicate four node and 16 CPU system, running with Fedora 37 and having +containerd as a container runtime. + +Executed commands with their output, exit status and timestamps are +saved under the `output/commands` directory. diff --git a/docs/resource-policy/developers-guide/figures/nri-resource-policy.png b/docs/resource-policy/developers-guide/figures/nri-resource-policy.png new file mode 100644 index 000000000..fdb385c4e Binary files /dev/null and b/docs/resource-policy/developers-guide/figures/nri-resource-policy.png differ diff --git a/docs/developers-guide/index.rst b/docs/resource-policy/developers-guide/index.rst similarity index 78% rename from docs/developers-guide/index.rst rename to docs/resource-policy/developers-guide/index.rst index 4a1b83f39..dc83815f6 100644 --- a/docs/developers-guide/index.rst +++ b/docs/resource-policy/developers-guide/index.rst @@ -4,5 +4,4 @@ Developer's Guide :maxdepth: 1 architecture.md - policy-writers-guide.md testing.rst diff --git a/docs/developers-guide/testing.rst b/docs/resource-policy/developers-guide/testing.rst similarity index 100% rename from docs/developers-guide/testing.rst rename to docs/resource-policy/developers-guide/testing.rst diff --git a/docs/developers-guide/unit-test.md b/docs/resource-policy/developers-guide/unit-test.md similarity index 100% rename from docs/developers-guide/unit-test.md rename to docs/resource-policy/developers-guide/unit-test.md diff --git a/docs/resource-policy/index.rst b/docs/resource-policy/index.rst new file mode 100644 index 000000000..48d2cdd44 --- /dev/null +++ b/docs/resource-policy/index.rst @@ -0,0 +1,17 @@ +.. NRI Resource Policy documentation master file + +Resource Policy Plugin +====================== + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + introduction.md + quick-start.md + installation.md + setup.md + policy/index.rst + node-agent.md + + developers-guide/index.rst diff --git a/docs/resource-policy/installation.md b/docs/resource-policy/installation.md new file mode 100644 index 000000000..3b2f72d87 --- /dev/null +++ b/docs/resource-policy/installation.md @@ -0,0 +1,26 @@ +# Installation + +## Installing from sources + +You will need at least `git`, {{ '`golang '+ '{}'.format(golang_version) + '`' }} or newer, +`GNU make`, `bash`, `find`, `sed`, `head`, `date`, and `install` to be able to build and install +from sources. + +Although not recommended, you can install NRI Resource Policy from sources: + +```console + git clone https://github.com/containers/nri-plugins + make && make images +``` + +After the images are created, you can copy the tar images from `build/images` to +the target device and deploy the relevant DaemonSet deployment file found also +in images directory. + +For example, you can deploy topology-aware resource policy like this: + +```console + cd build/images + ctr -n k8s.io image import nri-resource-policy-topology-aware-image-321ca3aad95e.tar + kubectl apply -f nri-resource-policy-topology-aware-deployment.yaml +``` diff --git a/docs/resource-policy/introduction.md b/docs/resource-policy/introduction.md new file mode 100644 index 000000000..9c5e47260 --- /dev/null +++ b/docs/resource-policy/introduction.md @@ -0,0 +1,12 @@ +# Introduction + +NRI Resource Policy is a NRI container runtime plugin. It is connected +to Container Runtime implementation (containerd, cri-o) via NRI API. +The main purpose of the the NRI resource plugin is to apply hardware-aware +resource allocation policies to the containers running in the system. + +There are different policies available, each with a different set of +goals in mind and implementing different hardware allocation strategies. The +details of whether and how a container resource request is altered or +if extra actions are performed depend on which policy plugin is running +and how that policy is configured. diff --git a/docs/node-agent.md b/docs/resource-policy/node-agent.md similarity index 50% rename from docs/node-agent.md rename to docs/resource-policy/node-agent.md index 2bf6da34e..ea7a57840 100644 --- a/docs/node-agent.md +++ b/docs/resource-policy/node-agent.md @@ -1,29 +1,27 @@ -# Node Agent +# Dynamic Configuration -CRI Resource Manager can be configured dynamically using the CRI Resource -Manager Node Agent and Kubernetes\* ConfigMaps. The agent is built in the -NRI resource plugin. +NRI Resource Policy plugin can be configured dynamically using ConfigMaps. -The agent monitors two ConfigMaps for the node, a primary node-specific one +The plugin daemon monitors two ConfigMaps for the node, a primary node-specific one and a secondary group-specific or default one, depending on whether the node belongs to a configuration group. The node-specific ConfigMap always takes precedence over the others. The names of these ConfigMaps are -1. `cri-resmgr-config.node.$NODE_NAME`: primary, node-specific configuration -2. `cri-resmgr-config.group.$GROUP_NAME`: secondary group-specific node +1. `nri-resource-policy-config.node.$NODE_NAME`: primary, node-specific configuration +2. `nri-resource-policy-config.group.$GROUP_NAME`: secondary group-specific node configuration -3. `cri-resmgr-config.default`: secondary: secondary default node +3. `nri-resource-policy-config.default`: secondary: secondary default node configuration You can assign a node to a configuration group by setting the -`cri-resource-manager.intel.com/group` label on the node to the name of +`resource-policy.nri.io/group` label on the node to the name of the configuration group. You can remove a node from its group by deleting the node group label. There is a -[sample ConfigMap spec](/sample-configs/nri-resmgr-configmap.example.yaml) +[sample ConfigMap spec](/sample-configs/nri-resource-policy-configmap.example.yaml) that contains a node-specific, a group-specific, and a default ConfigMap example. See [any available policy-specific documentation](policy/index.rst) for more information on the policy configurations. diff --git a/docs/policy/balloons.md b/docs/resource-policy/policy/balloons.md similarity index 94% rename from docs/policy/balloons.md rename to docs/resource-policy/policy/balloons.md index 4f752c77c..f6a9c80da 100644 --- a/docs/policy/balloons.md +++ b/docs/resource-policy/policy/balloons.md @@ -47,17 +47,15 @@ min and max frequencies on CPU cores and uncore. ## Deployment -### Install cri-resmgr - -Deploy cri-resmgr on each node as you would for any other policy. See -[installation](../installation.md) for more details. +Deploy nri-resource-policy-balloons on each node as you would for any +other policy. See [installation](../installation.md) for more details. ## Configuration The balloons policy is configured using the yaml-based configuration -system of CRI-RM. See [setup and -usage](../setup.md#setting-up-cri-resource-manager) for more details -on managing the configuration. +system of nri-resource-policy. +See [setup and usage](../setup.md#setting-up-nri-resource-policy) for +more details on managing the configuration. ### Parameters @@ -193,9 +191,9 @@ of a single container (`CONTAINER_NAME`). The last two annotations set the default balloon type for all containers in the pod. ```yaml -balloon.balloons.cri-resource-manager.intel.com/container.CONTAINER_NAME: BT -balloon.balloons.cri-resource-manager.intel.com/pod: BT -balloon.balloons.cri-resource-manager.intel.com: BT +balloon.balloons.resource-policy.nri.io/container.CONTAINER_NAME: BT +balloon.balloons.resource-policy.nri.io/pod: BT +balloon.balloons.resource-policy.nri.io: BT ``` If a pod has no annotations, its namespace is matched to the @@ -211,7 +209,7 @@ the `BalloonTypes` configuration. In order to enable more verbose logging and metrics exporting from the balloons policy, enable instrumentation and policy debugging from the -CRI-RM global config: +nri-resource-policy global config: ```yaml instrumentation: diff --git a/docs/policy/container-affinity.md b/docs/resource-policy/policy/container-affinity.md similarity index 90% rename from docs/policy/container-affinity.md rename to docs/resource-policy/policy/container-affinity.md index 6a1a90dab..d4a1ab2e2 100644 --- a/docs/policy/container-affinity.md +++ b/docs/resource-policy/policy/container-affinity.md @@ -2,10 +2,10 @@ ## Introduction -Some policies allow the user to give hints about how particular containers -should be *co-located* within a node. In particular these hints express whether -containers should be located *'close'* to each other or *'far away'* from each -other, in a hardware topology sense. +The topology-aware resource policy allow the user to give hints about how +particular containers should be *co-located* within a node. In particular these +hints express whether containers should be located *'close'* to each other or +*'far away'* from each other, in a hardware topology sense. Since these hints are interpreted always by a particular *policy implementation*, the exact definitions of 'close' and 'far' are also somewhat *policy-specific*. @@ -27,8 +27,8 @@ Policies try to place a container ## Affinity Annotation Syntax -*Affinities* are defined as the `cri-resource-manager.intel.com/affinity` annotation. -*Anti-affinities* are defined as the `cri-resource-manager.intel.com/anti-affinity` +*Affinities* are defined as the `resource-policy.nri.io/affinity` annotation. +*Anti-affinities* are defined as the `resource-manager.nri.io/anti-affinity` annotation. They are specified in the `metadata` section of the `Pod YAML`, under `annotations` as a dictionary, with each dictionary key being the name of the *container* within the Pod to which the annotation belongs to. @@ -36,7 +36,7 @@ annotation. They are specified in the `metadata` section of the `Pod YAML`, unde ```yaml metadata: anotations: - cri-resource-manager.intel.com/affinity: | + resource-manager.nri.io/affinity: | container1: - scope: key: key-ref @@ -55,13 +55,13 @@ metadata: weight: w ``` -An anti-affinity is defined similarly but using `cri-resource-manager.intel.com/anti-affinity` +An anti-affinity is defined similarly but using `resource-manager.nri.io/anti-affinity` as the annotation key. ```yaml metadata: anotations: - cri-resource-manager.intel.com/anti-affinity: | + resource-manager.nri.io/anti-affinity: | container1: - scope: key: key-ref @@ -197,7 +197,7 @@ container `wolf`. ```yaml metadata: annotations: - cri-resource-manager.intel.com/affinity: | + resource-manager.nri.io/affinity: | peter: - match: key: name @@ -205,7 +205,7 @@ metadata: values: - sheep weight: 5 - cri-resource-manager.intel.com/anti-affinity: | + resource-manager.nri.io/anti-affinity: | peter: - match: key: name @@ -223,9 +223,9 @@ one needs to give just the names of the containers, like in the example below. ```yaml annotations: - cri-resource-manager.intel.com/affinity: | + resource-manager.nri.io/affinity: | container3: [ container1 ] - cri-resource-manager.intel.com/anti-affinity: | + resource-manager.nri.io/anti-affinity: | container3: [ container2 ] container4: [ container2, container3 ] ``` @@ -243,14 +243,14 @@ The equivalent annotation in full syntax would be ```yaml metadata: annotations: - cri-resource-manager.intel.com/affinity: |+ + resource-manager.nri.io/affinity: |+ container3: - match: key: labels/io.kubernetes.container.name operator: In values: - container1 - cri-resource-manager.intel.com/anti-affinity: |+ + resource-manager.nri.io/anti-affinity: |+ container3: - match: key: labels/io.kubernetes.container.name diff --git a/docs/policy/cpu-allocator.md b/docs/resource-policy/policy/cpu-allocator.md similarity index 88% rename from docs/policy/cpu-allocator.md rename to docs/resource-policy/policy/cpu-allocator.md index 149ab669e..8d7eb0419 100644 --- a/docs/policy/cpu-allocator.md +++ b/docs/resource-policy/policy/cpu-allocator.md @@ -1,6 +1,6 @@ # CPU Allocator -CRI Resource Manager has a separate CPU allocator component that helps policies +NRI Resource Policy has a separate CPU allocator component that helps policies make educated allocation of CPU cores for workloads. Currently all policies utilize the built-in CPU allocator. See policy specific documentation for more details. @@ -14,7 +14,7 @@ request "near" each other in order to minimize memory latencies between CPUs. ## CPU Prioritization The CPU allocator also does automatic CPU prioritization by detecting CPU -features and their configuration parameters. Currently, CRI Resource Manager +features and their configuration parameters. Currently, NRI Resource Policy supports CPU priority detection based on the `intel_pstate` scaling driver in the Linux CPUFreq subsystem, and, Intel Speed Select Technology (SST). @@ -26,7 +26,7 @@ priority CPUs for high priority workloads. ### Intel Speed Select Technology (SST) -CRI Resource Manager supports detection of all Intel Speed Select Technology +NRI Resource Policy supports detection of all Intel Speed Select Technology (SST) features, i.e. Speed Select Technology Performance Profile (SST-PP), Base Frequency (SST-BF), Turbo Frequency (SST-TF) and Core Power (SST-CP). @@ -47,7 +47,7 @@ and their parameterization: ### Linux CPUFreq CPUFreq based prioritization only takes effect if Intel Speed Select Technology -(SST) is disabled (or not supported). CRI-RM divides CPU cores into priority +(SST) is disabled (or not supported). NRI-RM divides CPU cores into priority classes based on two parameters: - base frequency diff --git a/docs/policy/index.rst b/docs/resource-policy/policy/index.rst similarity index 84% rename from docs/policy/index.rst rename to docs/resource-policy/policy/index.rst index 35e1647a6..35bd500de 100644 --- a/docs/policy/index.rst +++ b/docs/resource-policy/policy/index.rst @@ -7,6 +7,4 @@ Policies topology-aware.md balloons.md container-affinity.md - blockio.md - rdt.md cpu-allocator.md diff --git a/docs/policy/topology-aware.md b/docs/resource-policy/policy/topology-aware.md similarity index 90% rename from docs/policy/topology-aware.md rename to docs/resource-policy/policy/topology-aware.md index b1659c7f8..9904e891d 100644 --- a/docs/policy/topology-aware.md +++ b/docs/resource-policy/policy/topology-aware.md @@ -39,8 +39,9 @@ dies, sockets, and finally the whole of the system at the root node. Leaf NUMA nodes are assigned the memory behind their controllers / zones and CPU cores with the smallest distance / access penalty to this memory. If the machine has multiple types of memory separately visible to both the kernel and user -space, for instance both DRAM and [PMEM](https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html), each zone of special type of memory -is assigned to the closest NUMA node pool. +space, for instance both DRAM and +[PMEM](https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html), +each zone of special type of memory is assigned to the closest NUMA node pool. Each non-leaf pool node in the tree is assigned the union of the resources of its children. So in practice, dies nodes end up containing all the CPU cores @@ -118,7 +119,7 @@ The `topology-aware` policy has the following features: ## Activating the Policy You can activate the `topology-aware` policy by using the following configuration -fragment in the configuration for `cri-resmgr`: +fragment in the configuration for `nri-resource-policy-topology-aware`: ```yaml policy: @@ -131,10 +132,9 @@ policy: The policy has a number of configuration options which affect its default behavior. These options can be supplied as part of the -[dynamic configuration](../setup.md#using-cri-resource-manager-agent-and-a-configmap) +[dynamic configuration](../setup.md#using-nri-resource-policy-agent-and-a-configmap) received via the [`node agent`](../node-agent.md), or in a fallback or forced -[configuration file](../setup.md#using-a-local-configuration-from-a-file). These -configuration options are +configuration file. These configuration options are - `PinCPU` * whether to pin workloads to assigned pool CPU sets @@ -247,11 +247,11 @@ following Pod annotation. metadata: annotations: # opt in container C1 to shared CPU core allocation - prefer-shared-cpus.cri-resource-manager.intel.com/container.C1: "true" + prefer-shared-cpus.resource-policy.nri.io/container.C1: "true" # opt in the whole pod to shared CPU core allocation - prefer-shared-cpus.cri-resource-manager.intel.com/pod: "true" + prefer-shared-cpus.resource-policy.nri.io/pod: "true" # selectively opt out container C2 from shared CPU core allocation - prefer-shared-cpus.cri-resource-manager.intel.com/container.C2: "false" + prefer-shared-cpus.resource-policy.nri.io/container.C2: "false" ``` Opting in to exclusive allocation happens by opting out from shared allocation, @@ -265,11 +265,11 @@ allocation using the following Pod annotation. metadata: annotations: # opt in container C1 to isolated exclusive CPU core allocation - prefer-isolated-cpus.cri-resource-manager.intel.com/container.C1: "true" + prefer-isolated-cpus.resource-policy.nri.io/container.C1: "true" # opt in the whole pod to isolated exclusive CPU core allocation - prefer-isolated-cpus.cri-resource-manager.intel.com/pod: "true" + prefer-isolated-cpus.resource-policy.nri.io/pod: "true" # selectively opt out container C2 from isolated exclusive CPU core allocation - prefer-isolated-cpus.cri-resource-manager.intel.com/container.C2: "false" + prefer-isolated-cpus.resource-policy.nri.io/container.C2: "false" ``` These Pod annotations have no effect on containers which are not eligible for @@ -277,12 +277,12 @@ exclusive allocation. ### Implicit Hardware Topology Hints -`CRI Resource Manager` automatically generates HW `Topology Hints` for devices +`NRI Resource Policy` automatically generates HW `Topology Hints` for devices assigned to a container, prior to handing the container off to the active policy for resource allocation. The `topology-aware` policy is hint-aware and normally -takes topology hints into account when picking the best pool to allocate -resources. Hints indicate optimal `HW locality` for device access and they can -alter significantly which pool gets picked for a container. +takes topology hints into account when picking the best pool to allocate resources. +Hints indicate optimal `HW locality` for device access and they can alter +significantly which pool gets picked for a container. Since device topology hints are implicitly generated, there are cases where one would like the policy to disregard them altogether. For instance, when a local @@ -295,11 +295,11 @@ pool selection using the following Pod annotations. metadata: annotations: # only disregard hints for container C1 - topologyhints.cri-resource-manager.intel.com/container.C1: "false" + topologyhints.resource-policy.nri.io/container.C1: "false" # disregard hints for all containers by default - topologyhints.cri-resource-manager.intel.com/pod: "false" + topologyhints.resource-policy.nri.io/pod: "false" # but take hints into account for container C2 - topologyhints.cri-resource-manager.intel.com/container.C2: "true" + topologyhints.resource-policy.nri.io/container.C2: "true" ``` Topology hint generation is globally enabled by default. Therefore, using the @@ -336,8 +336,8 @@ begin with. Cold start is configured like this in the pod metadata: ```yaml metadata: annotations: - memory-type.cri-resource-manager.intel.com/container.container1: dram,pmem - cold-start.cri-resource-manager.intel.com/container.container1: | + memory-type.resource-policy.nri.io/container.container1: dram,pmem + cold-start.resource-policy.nri.io/container.container1: | duration: 60s ``` @@ -348,9 +348,9 @@ future release: ```yaml metadata: annotations: - cri-resource-manager.intel.com/memory-type: | + resource-policy.nri.io/memory-type: | container1: dram,pmem - cri-resource-manager.intel.com/cold-start: | + resource-policy.nri.io/cold-start: | container1: duration: 60s ``` @@ -386,7 +386,7 @@ every two seconds from DRAM to PMEM. ## Container memory requests and limits -Due to inaccuracies in how `cri-resmgr` calculates memory requests for +Due to inaccuracies in how `nri-resource-policy` calculates memory requests for pods in QoS class `Burstable`, you should either use `Limit` for setting the amount of memory for containers in `Burstable` pods to provide `cri-resmgr` with an exact copy of the resource requirements from the Pod Spec as an extra @@ -427,6 +427,6 @@ For example: ```yaml metadata: annotations: - prefer-reserved-cpus.cri-resource-manager.intel.com/pod: "true" - prefer-reserved-cpus.cri-resource-manager.intel.com/container.special: "false" + prefer-reserved-cpus.resource-policy.nri.io/pod: "true" + prefer-reserved-cpus.resource-policy.nri.io/container.special: "false" ``` diff --git a/docs/resource-policy/quick-start.md b/docs/resource-policy/quick-start.md new file mode 100644 index 000000000..254007682 --- /dev/null +++ b/docs/resource-policy/quick-start.md @@ -0,0 +1,85 @@ +# Quick-start + +The following describes the minimum number of steps to get started with NRI +Resource Policy plugin. + +## Pre-requisites + +- containerd or cri-o container runtime installed and running, and also + NRI feature enabled. +- kubelet installed on your nodes + +Note that for both the containerd and cri-o must have NRI support enabled. +For containerd, the NRI is currently only available in 1.7beta or later release. +For cri-o it is recommended to use version 1.26.0 or later. + +## Setup NRI Resource Policy Plugin + +First, compile the resource plugins and create deployment image. + +```console +git clone https://github.com/containers/nri-plugins +make && make images +``` + +### Deploy Daemonset + +The build/ directory will contain the needed images and deployment +files. Copy the plugin .yaml file and corresponding image file into +the node and deploy it there. + +For example: + +```console +ls build/images + nri-resource-policy-balloons-deployment.yaml + nri-resource-policy-balloons-image-ed6fffe77071.tar + nri-resource-policy-topology-aware-deployment.yaml + nri-resource-policy-topology-aware-image-9797e8de7107.tar +``` + +Copy the nri-resource-policy-topology-aware-deployment.yaml and the +latest tar file, that was generated by the `make images` command. + +This will create a fresh config file and backup the old one if it existed: + +```console +[ -f /etc/containerd/config.toml ] && cp /etc/containerd/config.toml.backup +containerd config default > /etc/containerd/config.toml +``` + +Edit the `/etc/containerd/config.toml` file and set `plugins."io.containerd.nri.v1.nri"` +option `disable = true` to `disable = false` and restart containerd. + +If you are running cri-o, the NRI enabling can be done like this: + +```console +mkdir -p /etc/crio/crio.conf.d +cat > /etc/crio/crio.conf.d/10-enable-nri.conf <