Skip to content

Commit

Permalink
Merge branch 'kubernetes:master' into add-strict-mode-to-force-retrie…
Browse files Browse the repository at this point in the history
…s-during-prometheus-history-loading
  • Loading branch information
Evedel authored Oct 28, 2023
2 parents 7d3747a + 4d8e654 commit c837778
Show file tree
Hide file tree
Showing 1,650 changed files with 333,311 additions and 70,071 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- name: Set up Go
uses: actions/setup-go@v2
with:
go-version: '>=1.20.0'
go-version: '1.21.3'

- uses: actions/checkout@v2
with:
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,5 @@
Session.vim
.netrwhist

# Binary files
bin/
2 changes: 2 additions & 0 deletions addon-resizer/OWNERS
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
approvers:
- kwiesmueller
- jbartosik
reviewers:
- kwiesmueller
- jbartosik
emeritus_approvers:
- bskiba # 2022-09-30
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# KEP-5546: Automatic reload of nanny configuration when updated

<!-- toc -->
- [Summary](#summary)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Notes](#notes)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
<!-- /toc -->

Sure, here's the enhancement proposal in the requested format:

## Summary
- **Goals:** The goal of this enhancement is to improve the user experience for applying nanny configuration changes in the addon-resizer 1.8 when used with the metrics server. The proposed solution involves automatically reloading the nanny configuration whenever changes occur, eliminating the need for manual intervention and sidecar containers.
- **Non-Goals:** This proposal does not aim to update the functional behavior of the addon-resizer.

## Proposal
The proposed solution involves updating the addon-resizer with the following steps:
- Create a file system watcher using `fsnotify` under `utils/fswatcher` to watch nanny configurations' changes. It should run as a goroutine in the background.
- Detect changes of the nanny configurations' file using the created `fswatcher` trigger the reloading process when configuration changes are detected. Events should be sent in a channel.
- Re-execute the method responsible for building the NannyConfiguration `loadNannyConfiguration` to apply the updated configuration to the addon-resizer.
- Proper error handling should be implemented to manage scenarios where the configuration file is temporarily inaccessible or if there are parsing errors in the configuration file.

### Risks and Mitigations
- There is a potential risk of filesystem-related issues causing the file watcher to malfunction. Proper testing and error handling should be implemented to handle such scenarios gracefully.
- Errors in the configuration file could lead to unexpected behavior or crashes. The addon-resizer should handle parsing errors and fall back to the previous working configuration if necessary.

## Design Details
- Create a new package for the `fswatcher` under `utils/fswatcher`. It would contain the `fswatcher` struct and methods and unit-tests.
- `FsWatcher` struct would look similar to this:
```go
type FsWatcher struct {
*fsnotify.Watcher

Events chan struct{}
ratelimit time.Duration
names []string
paths map[string]struct{}
}
```
- Implement the following functions:
- `CreateFsWatcher`: Instantiates a new `FsWatcher` and start watching on file system.
- `initWatcher`: Initializes the `fsnotify` watcher and initialize the `paths` that would be watched.
- `add`: Adds a new file to watch.
- `reset`: Re-initializes the `FsWatcher`.
- `watch`: watches for the configured files.
- In the main function, we create a new `FsWatcher` and then we wait in an infinite loop to receive events indicating
filesystem changes. Based on these changes, we re-execute `loadNannyConfiguration` function.

> **Note:** The expected configuration file format is YAML. It has the same structure as the NannyConfiguration CRD.

### Test Plan
To ensure the proper functioning of the enhanced addon-resizer, the following test plan should be executed:
1. **Unit Tests:** Write unit tests to validate the file watcher's functionality and ensure it triggers events when the configuration file changes.
2. **Manual e2e Tests:** Deploy the addon-resizer with `BaseMemory` of `300Mi` and then we change the `BaseMemory` to `100Mi`. We should observer changes in the behavior of watched pod.
[fsnotify]: https://github.com/fsnotify/fsnotify
2 changes: 1 addition & 1 deletion builder/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

FROM golang:1.20.4
FROM golang:1.21.3
LABEL maintainer="Marcin Wielgus <[email protected]>"

ENV GOPATH /gopath/
Expand Down
2 changes: 1 addition & 1 deletion charts/cluster-autoscaler/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ name: cluster-autoscaler
sources:
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
type: application
version: 9.29.1
version: 9.29.4
1 change: 1 addition & 0 deletions charts/cluster-autoscaler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,7 @@ vpa:
| rbac.serviceAccount.name | string | `""` | The name of the ServiceAccount to use. If not set and create is `true`, a name is generated using the fullname template. |
| replicaCount | int | `1` | Desired number of pods |
| resources | object | `{}` | Pod resource requests and limits. |
| secretKeyRefNameOverride | string | `""` | Overrides the name of the Secret to use when loading the secretKeyRef for AWS and Azure env variables |
| securityContext | object | `{}` | [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) |
| service.annotations | object | `{}` | Annotations to add to service |
| service.create | bool | `true` | If `true`, a Service will be created. |
Expand Down
11 changes: 9 additions & 2 deletions charts/cluster-autoscaler/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -151,15 +151,22 @@ rules:
- cluster.x-k8s.io
resources:
- machinedeployments
- machinedeployments/scale
- machinepools
- machinepools/scale
- machines
- machinesets
verbs:
- get
- list
- update
- watch
- apiGroups:
- cluster.x-k8s.io
resources:
- machinedeployments/scale
- machinepools/scale
verbs:
- get
- patch
- update
{{- end }}
{{- end -}}
22 changes: 11 additions & 11 deletions charts/cluster-autoscaler/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ spec:
- --node-group-auto-discovery=mig:namePrefix={{ .name }},min={{ .minSize }},max={{ .maxSize }}
{{- end }}
{{- end }}
{{- if eq .Values.cloudProvider "oci-oke" }}
{{- if eq .Values.cloudProvider "oci" }}
{{- if .Values.cloudConfigPath }}
- --nodes={{ .minSize }}:{{ .maxSize }}:{{ .name }}
- --balance-similar-node-groups
Expand Down Expand Up @@ -132,36 +132,36 @@ spec:
valueFrom:
secretKeyRef:
key: AwsAccessKeyId
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- if .Values.awsSecretAccessKey }}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AwsSecretAccessKey
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- else if eq .Values.cloudProvider "azure" }}
- name: ARM_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
key: SubscriptionID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_RESOURCE_GROUP
valueFrom:
secretKeyRef:
key: ResourceGroup
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_VM_TYPE
valueFrom:
secretKeyRef:
key: VMType
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: AZURE_CLUSTER_NAME
valueFrom:
secretKeyRef:
key: ClusterName
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- if .Values.azureUseWorkloadIdentityExtension }}
- name: ARM_USE_WORKLOAD_IDENTITY_EXTENSION
value: "true"
Expand All @@ -173,22 +173,22 @@ spec:
valueFrom:
secretKeyRef:
key: TenantID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_CLIENT_ID
valueFrom:
secretKeyRef:
key: ClientID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_CLIENT_SECRET
valueFrom:
secretKeyRef:
key: ClientSecret
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: AZURE_NODE_RESOURCE_GROUP
valueFrom:
secretKeyRef:
key: NodeResourceGroup
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- end }}
{{- range $key, $value := .Values.extraEnv }}
Expand Down
11 changes: 9 additions & 2 deletions charts/cluster-autoscaler/templates/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,23 @@ rules:
- cluster.x-k8s.io
resources:
- machinedeployments
- machinedeployments/scale
- machinepools
- machinepools/scale
- machines
- machinesets
verbs:
- get
- list
- update
- watch
- apiGroups:
- cluster.x-k8s.io
resources:
- machinedeployments/scale
- machinepools/scale
verbs:
- get
- patch
- update
{{- end }}
{{- if ( not .Values.rbac.clusterScoped ) }}
- apiGroups:
Expand Down
12 changes: 9 additions & 3 deletions charts/cluster-autoscaler/templates/secret.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
{{- if or (eq .Values.cloudProvider "azure") (and (eq .Values.cloudProvider "aws") (not (has "" (list .Values.awsAccessKeyID .Values.awsSecretAccessKey)))) }}
{{- if not .Values.secretKeyRefNameOverride }}
{{- $isAzure := eq .Values.cloudProvider "azure" }}
{{- $isAws := eq .Values.cloudProvider "aws" }}
{{- $awsCredentialsProvided := and .Values.awsAccessKeyID .Values.awsSecretAccessKey }}

{{- if or $isAzure (and $isAws $awsCredentialsProvided) }}
apiVersion: v1
kind: Secret
metadata:
name: {{ template "cluster-autoscaler.fullname" . }}
namespace: {{ .Release.Namespace }}
data:
{{- if eq .Values.cloudProvider "azure" }}
{{- if $isAzure }}
ClientID: "{{ .Values.azureClientID | b64enc }}"
ClientSecret: "{{ .Values.azureClientSecret | b64enc }}"
ResourceGroup: "{{ .Values.azureResourceGroup | b64enc }}"
Expand All @@ -14,8 +19,9 @@ data:
VMType: "{{ .Values.azureVMType | b64enc }}"
ClusterName: "{{ .Values.azureClusterName | b64enc }}"
NodeResourceGroup: "{{ .Values.azureNodeResourceGroup | b64enc }}"
{{- else if eq .Values.cloudProvider "aws" }}
{{- else if $isAws }}
AwsAccessKeyId: "{{ .Values.awsAccessKeyID | b64enc }}"
AwsSecretAccessKey: "{{ .Values.awsSecretAccessKey | b64enc }}"
{{- end }}
{{- end }}
{{- end }}
5 changes: 4 additions & 1 deletion charts/cluster-autoscaler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ affinity: {}
additionalLabels: {}

autoDiscovery:
# cloudProviders `aws`, `gce`, `azure`, `magnum` and `clusterapi` `oci-oke` are supported by auto-discovery at this time
# cloudProviders `aws`, `gce`, `azure`, `magnum`, `clusterapi` and `oci` are supported by auto-discovery at this time
# AWS: Set tags as described in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup

# autoDiscovery.clusterName -- Enable autodiscovery for `cloudProvider=aws`, for groups matching `autoDiscovery.tags`.
Expand Down Expand Up @@ -396,3 +396,6 @@ vpa:
updateMode: "Auto"
# vpa.containerPolicy -- [ContainerResourcePolicy](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler/v0.13.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L159). The containerName is always et to the deployment's container name. This value is required if VPA is enabled.
containerPolicy: {}

# secretKeyRefNameOverride -- Overrides the name of the Secret to use when loading the secretKeyRef for AWS and Azure env variables
secretKeyRefNameOverride: ""
1 change: 1 addition & 0 deletions cluster-autoscaler/.gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
cluster-autoscaler
cluster-autoscaler-amd64
cluster-autoscaler-arm64
cluster-autoscaler-s390x
cluster_autoscaler
.cover

Expand Down
51 changes: 49 additions & 2 deletions cluster-autoscaler/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ this document:
* [Is Cluster Autoscaler compatible with CPU-usage-based node autoscalers?](#is-cluster-autoscaler-compatible-with-cpu-usage-based-node-autoscalers)
* [How does Cluster Autoscaler work with Pod Priority and Preemption?](#how-does-cluster-autoscaler-work-with-pod-priority-and-preemption)
* [How does Cluster Autoscaler remove nodes?](#how-does-cluster-autoscaler-remove-nodes)
* [How does Cluster Autoscaler treat nodes with status/startup/ignore taints?](#how-does-cluster-autoscaler-treat-nodes-with-taints)
* [How to?](#how-to)
* [I'm running cluster with nodes in multiple zones for HA purposes. Is that supported by Cluster Autoscaler?](#im-running-cluster-with-nodes-in-multiple-zones-for-ha-purposes-is-that-supported-by-cluster-autoscaler)
* [How can I monitor Cluster Autoscaler?](#how-can-i-monitor-cluster-autoscaler)
* [How can I increase the information that the CA is logging?](#how-can-i-increase-the-information-that-the-ca-is-logging)
* [How can I change the log format that the CA outputs?](#how-can-i-change-the-log-format-that-the-ca-outputs)
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler)
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node)
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0)
Expand Down Expand Up @@ -125,8 +127,8 @@ Since version 1.0.0 we consider CA as GA. It means that:
* We have enough confidence that it does what it is expected to do. Each commit goes through a big suite of unit tests
with more than 75% coverage (on average). We have a series of e2e tests that validate that CA works well on
[GCE](https://k8s-testgrid.appspot.com/sig-autoscaling#gce-autoscaling)
and [GKE](https://k8s-testgrid.appspot.com/sig-autoscaling#gke-autoscaling).
[GCE](https://testgrid.k8s.io/sig-autoscaling#gce-autoscaling)
and [GKE](https://testgrid.k8s.io/sig-autoscaling#gke-autoscaling).
Due to the missing testing infrastructure, AWS (or any other cloud provider) compatibility
tests are not the part of the standard development or release procedure.
However there is a number of AWS users who run CA in their production environment and submit new code, patches and bug reports.
Expand Down Expand Up @@ -248,7 +250,37 @@ Cluster Autoscaler terminates the underlying instance in a cloud-provider-depend
It does _not_ delete the [Node object](https://kubernetes.io/docs/concepts/architecture/nodes/#api-object) from Kubernetes. Cleaning up Node objects corresponding to terminated instances is the responsibility of the [cloud node controller](https://kubernetes.io/docs/concepts/architecture/cloud-controller/#node-controller), which can run as part of [kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) or [cloud-controller-manager](https://kubernetes.io/docs/concepts/architecture/cloud-controller/).
### How does Cluster Autoscaler treat nodes with status/startup/ignore taints?
### Startup taints
Startup taints are meant to be used when there is an operation that has to complete before any pods can run on the node, e.g. drivers installation.
Cluster Autoscaler treats nodes tainted with `startup taints` as unready, but taken into account during scale up logic, assuming they will become ready shortly.
**However, if the substantial number of nodes are tainted with `startup taints` (and therefore unready) for an extended period of time the Cluster Autoscaler
might stop working as it might assume the cluster is broken and should not be scaled (creating new nodes doesn't help as they don't become ready).**
Startup taints are defined as:
- all taints with the prefix `startup-taint.cluster-autoscaler.kubernetes.io/`,
- all taints defined using `--startup-taint` flag.
### Status taints
Status taints are meant to be used when a given node should not be used to run pods for the time being.
Cluster Autoscaler internally treats nodes tainted with `status taints` as ready, but filtered out during scale up logic.
This means that even though the node is ready, no pods should run there as long as the node is tainted and if necessary a scale-up should occur.
Status taints are defined as:
- all taints with the prefix `status-taint.cluster-autoscaler.kubernetes.io/`,
- all taints defined using `--status-taint` flag.
### Ignore taints
Ignore taints are now deprecated and treated as startup taints.
Ignore taints are defined as:
- all taints with the prefix `ignore-taint.cluster-autoscaler.kubernetes.io/`,
- all taints defined using `--ignore-taint` flag.
****************
# How to?
Expand Down Expand Up @@ -788,6 +820,7 @@ The following startup parameters are supported for cluster autoscaler:
| `cordon-node-before-terminating` | Should CA cordon nodes before terminating during downscale process | false
| `record-duplicated-events` | Enable the autoscaler to print duplicated events within a 5 minute window. | false
| `debugging-snapshot-enabled` | Whether the debugging snapshot of cluster autoscaler feature is enabled. | false
| `node-delete-delay-after-taint` | How long to wait before deleting a node after tainting it. | 5 seconds

# Troubleshooting:

Expand Down Expand Up @@ -923,6 +956,20 @@ or infrastructure endpoints, then setting a value of `--v=9` will show all the i
HTTP calls made. Be aware that using verbosity levels higher than `--v=1` will generate
an increased amount of logs, prepare your deployments and storage accordingly.

### How Can I change the log format that the CA outputs?

There are 2 log format options, `text` and `json`. By default (`text`), the Cluster Autoscaler will output
logs in the [klog native format](https://kubernetes.io/docs/concepts/cluster-administration/system-logs/#klog-output).
```
I0823 17:15:11.472183 29944 main.go:569] Cluster Autoscaler 1.28.0-beta.0
```

Alternatively, adding the flag `--logging-format=json` changes the
[log output to json](https://kubernetes.io/docs/concepts/cluster-administration/system-logs/#klog-output).
```
{"ts":1692825334994.433,"caller":"cluster-autoscaler/main.go:569","msg":"Cluster Autoscaler 1.28.0-beta.0\n","v":1}
```

### What events are emitted by CA?

Whenever Cluster Autoscaler adds or removes nodes it will create events
Expand Down
Loading

0 comments on commit c837778

Please sign in to comment.