This is a new major release of NRI Reference Plugins. It brings several new features, a number of bug fixes, improvements to the build system, to CI, end-to-end tests, and test coverage.
What's New
Balloons Policy
-
New
preserve
policy option enables matching containers whose CPU
and memory affinity must not be modified by the resource policy.This enables allowing selected containers to access all CPUs and
memories. For example, allow pcm-sensor-server
to access MSRs on every CPU for low-level metrics:preserve: matchExpressions: - key: pod/labels/app.kubernetes.io/name operator: In values: - pcm-sensor-server
Earlier this required
cpu.preserve.resource-policy.nri.io
and
memory.preserve.resource-policy.nri.io
pod annotations. -
New
freqGovernor
CPU class option enables setting CPU frequency
governor based on the CPU class of a balloon. Example:balloonTypes: - name: powersaving cpuClass: mypowersave control: cpu: classes: mypowersave: freqGovernor: powersave
-
New
memoryTypes
balloon type option specifies required memory
types when setting memory affinity. For example, containers in
high-memory-bandwidth balloons will use only HBM when configured as:balloonTypes: - name: high-memory-bandwidth memoryTypes: - HBM
-
Support
memory-type.resource-policy.nri.io
pod annotation for
setting memory affinity into closest HBM, DRAM, PMEM, or any
combination. This annotation is a pod level override to the
memoryTypes
balloon type option. -
L2-cache group aware CPU allocation and sharing. For example,
containers in a balloon can be allowed to burst on idle
(unallocated) CPUs that share the same L2 cache as CPUs allocated to
the balloon.balloonTypes: - name: l2burst shareIdleCPUsInSame: l2cache
-
Override to
pinMemory
policy option in balloon type level. Enables
setting memory affinity of containers only in certain balloons while
others are not set, and vice versa. Example:pinMemory: false balloonTypes: - name: latency-sensitive pinMemory: true preferIsolCpus: true preferNewBalloons: true
-
New default configuration runs Guaranteed containers on dedicated
CPUs while BestEffort and Burstable containers are allowed to share
remaining CPUs on the same socket, but not cross socket boundaries. -
Balance BestEffort containers between balloons with equal amount of
available resources. -
Smaller risk for OOMs on
pinMemory: true
, as memory affinity was
refactored to use smart libmem.
Topology Aware Policy
The Topology Aware policy can now export prometheus metrics per topology zone. Exported metrics include pool CPU set and memory set, shared CPU subpool total capacity, allocations and available capacity, memory total capacity, allocations and available amount, number of assigned containers and containers in the shared subpool.
To enable exporting these metrics, make sure that you are running with the latest policy configuration custom resource definition and you have policy
included in the spec/instrumentation/metrics/enabled
slice, like this:
...
spec:
...
instrumentation:
...
metrics:
enabled:
- policy
...
The Topology Aware policy can now use data from the kubelet's Pod Resource API to generate extra topology hints for resource allocation and alignment. These hints are disabled in the default configuration installed by Helm charts. To enable them, make sure that you are running with the latest policy configuration custom resource definition and you have spec/agent/podResourceAPI
set to true in the configuration, like this:
spec:
agent:
...
podResourceAPI: true
...
- Support
memory-type.resource-policy.nri.io
pod annotation for
setting memory affinity into closest HBM, DRAM or PMEM, or any
combination.
What's Changed
Balloons Policy Fixes and Improvements
- balloons: add "preserve" option to match containers whose pinning must not be modified by @askervin in #368
- balloons: add support for cpu frequency governor tuning by @fmuyassarov in #374
- balloons: set frequency scaling governor only when requested by @fmuyassarov in #379
- balloons: improve handling of containers with no CPU requests by @askervin in #386
- balloons: add debug logging to selecting a balloon type by @askervin in #396
- balloons: support for L2 cache cluster allocation by @askervin in #384
- balloons: add memoryTypes to balloon types by @askervin in #395
- Add balloon type specific pinMemory option by @askervin in #451
Topology Aware Policy Fixes and Improvements
- metrics: add topology-aware policy metrics collection. by @klihub in #406
- topology-aware: correctly reconfigure implicit affinities for configuration changes. by @klihub in #394
- fixes: copy assigned memory zone in grant clone. by @klihub in #413
New Policy Agnostic Metrics, Common De Facto Exporters
- metrics: cleanup metrics registration, collection and gathering. by @klihub in #403
- metrics: add de-facto standard collectors. by @klihub in #404
- metrics: simplify policy/backend metrics collection interface. by @klihub in #408
- metrics: add policy system collector. by @klihub in #405
Topology Hints Based on Pod Resource API
- podresapi: agent,config,helm: make agent runtime configurable. by @klihub in #418
- podresapi: resmgr,agent: generate topology hints from Pod Resource API. by @klihub in #419
- podresapi: topology-aware: use Pod Resource API hints if present. by @klihub in #420
- agent,resmgr: merge PodResources{List,Map}, cache last List() result. by @klihub in #423
Common Resource Management Fixes and Improvements
- resmgr: fix "qosclass" in policy expressions by @askervin in #387
- resmgr,agent: propagate startup config error back to CR. by @klihub in #416
- libmem: implement policy-agnostic memory allocation/accounting. by @klihub in #332
- libmem: typo and thinko fixes. by @klihub in #381
- sysfs: enable faking CPU cache configurations using OVERRIDE_SYS_CACHES by @askervin in #383
- cpuallocator, plugins: handle priority as an option. by @klihub in #414
- Fix typos in expression code doc and matchExpression yamls by @askervin in #370
Helm Chart and Configuration Fixes and Improvements
- helm: enable prometheus autodiscovery by @klihub in #393
- helm: new balloons default configuration by @askervin in #391
- apis/config: use consistent assignment in +kubebuilder:validation tags. by @klihub in #397
- sample-configs: fix a copy-pasted comment thinko. by @klihub in #402
End-to-end Testing Fixes and Improvements
- e2e: pull and save runtime logs after each test. by @klihub in #367
- e2e: adjust metrics test for updated PrettyName(). by @klihub in #366
- e2e: switch default test distro to fedora/40-cloud-base. by @klihub in #375
- e2e: fix provisioning for Ubuntu cloud image. by @klihub in #377
- e2e: enable vagrant debugging. by @klihub in #376
- e2e: adjust $VM_HOSTNAME for policy node config usage. by @klihub in #378
- e2e: skip long running tests by default. by @klihub in #373
- e2e: fix command filenames in test output directories by @askervin in #390
- e2e: containerd 2.0.0. provisioning fixup. by @klihub in #400
- e2e/balloons: remove unknown/unused helm-launch argument. by @klihub in #407
Build Environment Fixes and Improvements
- build: enable building debug binaries and images by @askervin in #388
- build: update controller-tools to v0.16.5. by @klihub in #398
- build: enable race-detector in DEBUG=1 builds. by @klihub in #409
- build: enable race-detector in image build, too. by @klihub in #410
- dev: add Tiltfile for local development by @fmuyassarov in #382
- Tilt: turn on prometheus metrics exporting by default for local development by @fmuyassarov in #411
- images: fix FromAsCasing warnings by @fmuyassarov in #380
- fixes: fix vagrant dotenv loading and default qemu directory. by @klihub in #389
- Migrate code-gen to kube_codegen.sh by @fmuyassarov in #412
- operator: ensure tree is restored to a clean state by @fmuyassarov in #415
- docs: fix build error, avoid testdata scan infinite loop. by @klihub in #421
Dependency Updates:
- golang: bump golang version to 1.22.10. by @klihub in #453
- go mod: bump goresctrl to latest by @fmuyassarov in #369
Codespell Fixes, Codespell Now Enabled in CI
- .codespell*,.github,*: add codespell configuration, workflow, fix codespell errors. by @klihub in #356
- .github: one more codespell fix. by @klihub in #371
- codespell: ignore more files. by @klihub in #372
- .github: add workflow for rejecting PRs that introduce whitespace errors. by @klihub in #171
Golangci-lint Fixes, Golangci-lint Now Enabled in CI
- template: golangci-lint fixes. by @klihub in #445
- memtierd: golangci-lint fixes. by @klihub in #444
- memory-qos: golangci-lint fixes. by @klihub in #443
- scripts: clean up all tests VMs after test. by @klihub in #446
- cache: update tests. by @klihub in #422
- apis/resmgr: golangci-lint fixes. by @klihub in #426
- cache: golangci-lint fixes. by @klihub in #427
- cgroups: golangci-lint fixes. by @klihub in #428
- cgroupstats: golangci-lint fixes. by @klihub in #429
- config: golangci-lint fixes. by @klihub in #430
- resmgr/control: golangci-lint fixes. by @klihub in #431
- cpuallocator: golangci-lint fixes. by @klihub in #432
- healthz: golangci-lint fixes. by @klihub in #433
- http: golangci-lint fixes. by @klihub in #434
- instrumentation: golangci-lint fixes. by @klihub in #435
- libmem: golangci-lint fixes. by @klihub in #436
- log: golangci-lint fixes. by @klihub in #437
- metrics: golangci-lint fixes. by @klihub in #438
- agent: golangci-lint fixes. by @klihub in #425
- pidfile: golangci-lint fixes. by @klihub in #439
- resmgr/policy: golangci-lint fixes. by @klihub in #440
- resmgr: golangci-lint fixes. by @klihub in #441
- balloons: golangci-lint fixes. by @klihub in #447
- topology-aware: golangci-lint fixes. by @klihub in #448
- sysfs: golangci-lint fixes. by @klihub in #442
- .github: add golangci-lint, split PR verification into multiple jobs. by @klihub in #450
Full Changelog
Full Changelog: v0.7.1...v0.8.0