All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
4.83.0 - 2024-12-17
- Replace the silence link on Slack with silence repository link
4.82.0 - 2024-12-17
- Update Alertmanager notification template
- Update alert and query URLs for Mimir to point to the Active notification rather than the rule page
- Move link section (runbook, dashboard, explors) before instance to avoid them being lost due to OpsGenie max description being reached
- Move the warnings for missing runbook and dashboard up into the link section
- Replace Alertmanager and silence link with silence repository
- Get rid of useless
prometheus-agent
after the migration to the newmonitoring-agent
inhibitions.
4.81.0 - 2024-10-30
- Create new
monitoring-agent
inhibitions based on theprometheus-agent
inhibitions to be agnostic to the tool used to monitor targets.
4.80.0 - 2024-10-21
- Add
customer
label to OpsGenie alerts.
4.79.0 - 2024-09-17
- Renamed Team Tinkerers to Team Tenet
4.78.2 - 2024-09-16
- Remove unused
#alert
and#alert-test-installation
slack integration.
4.78.1 - 2024-07-04
- Fix AlertManager configuration generation on Vintage.
4.78.0 - 2024-07-03
- Add new node not ready inhibition configurations.
- Added support to be able to disable alertmanager on mimir installations.
- Clean up unused configured inhibitions.
4.77.3 - 2024-06-19
- Pass over mimir enabled to managementcluster
remotewriteingressauth
andremotewriteingress
resources.
4.77.2 - 2024-06-19
- Remove line-breaks in alerting links which suppress links in notifications.
4.77.1 - 2024-06-19
- Reverse ingress removal condition to remove the ingress when mimir is enabled.
4.77.0 - 2024-06-19
- Remove AlertManager link in opsgenie and slack templating when mimir is enabled.
- Remove unused scrape_timeout inhibition.
- Some improvements towards Mimir:
- Internal rework to remove the use generic resource to ease out the migration to Mimir.
- Update generic resource so we can delete resources when mimir is enabled.
- Remove legacy prometheus resources when Mimir is enabled.
- Remove alertmanager ingress when Mimir is enabled.
- Ignore the prometheus-to-grafana-cloud prometheus in the remove write controller.
- Change Alert link to point to Mimir alerting UI when Mimir is enabled.
- Rename Prometheus link to Source
4.76.0 - 2024-06-03
- Delete per cluster heartbeats when Mimir is enabled.
- Delete per cluster heartbeat alertmanager wiring when Mimir is enabled.
4.75.1 - 2024-05-23
- Remove prometheus remote write agent configuration when mimir is enabled.
- Remove unnecessary prometheus control-plane affinity.
4.75.0 - 2024-05-13
- Add
cluster_control_plane_unhealthy
inhibition. - Allow Prometheus Agent Sharding strategy to be overridden per cluster.
- Removed
apiserver_down
inhibition.
- Use
kubernetes.io/tls
type for TLS secrets.
4.74.0 - 2024-05-02
- Expose prometheus agent sharding strategies as prometheus-meta-operator configuration parameters so we can experiment with the scaling strategies.
4.73.1 - 2024-05-01
- Ensure proxy url is set when needed within slack_configs
4.73.0 - 2024-04-30
- To ensure that customers can define their own AlertmanagerConfig CRs, we need to remove the default alertmanager matcher injection (cf. upstream prometheus-operator/prometheus-operator#4033)
- Add
SlackApiToken
configuration directive.
4.72.0 - 2024-04-03
- This PR adds a receiver and a route for the mimir heartbeat. We need to add them here until we use mimir's alertmanager.
4.71.0 - 2024-03-19
- Add team
honeybadger
slack router and receiver.
- Remove the
azure
provider.
4.70.3 - 2024-03-18
- Fix missing data in labelling schema to add missing labels to avoid issues with aggregations of data coming from prometheus agents that have some extra labels set as opposed to the existing prometheus scrape config.
4.70.2 - 2024-03-13
- Fix missing prometheus link in notification template.
4.70.1 - 2024-03-12
- Fix noop resource creation code.
4.70.0 - 2024-03-12
- Remove alerting from Prometheus if mimir is enabled.
4.69.0 - 2024-03-12
- Disable rule evaluation in Prometheus when Mimir is enabled.
- Remove
prometheus
andprometheus_replica
external labels when Mimir is enabled.
4.68.4 - 2024-03-06
- Remove
falco-exporter
from static scrapeconfig as they are now monitored via servicemonitors.
4.68.3 - 2024-02-19
- Fix alertmanager ciliumnetworkpolicy to allow access to coredns.
4.68.2 - 2024-02-19
- Add missing ciliumnetworkpolicy for alertmanager.
4.68.1 - 2024-02-15
- Add proxy port to CiliumNetworkPolicy if needed.
4.68.0 - 2024-02-14
- Add CNP for prometheus-meta-operator to be able to talk to the api-server in locked-down clusters.
4.67.3 - 2024-02-13
- Add
update
method to cilium netpol resource.
4.67.2 - 2024-02-13
- Add grafana-cloud squid proxy port to prometheus CNP.
4.67.1 - 2024-02-13
- Fix error for already existing
ciliumNetworkPolicy
.
4.67.0 - 2024-02-12
- Add
ciliumNetworkPolicy
for all Prometheus instances on the MC.
4.66.1 - 2024-02-07
- Fix VPA to support latest Prometheus-operator version (based on observability-bundle 1.2.0) as the latest version of the Prometheus CR now supports the
scale
subresource which causes issues with VPA.
4.66.0 - 2024-02-06
- Support multi-provider Management clusters.
- Fix how we enable
remote-write-receiver
to avoid deprecated warnings. - Fix test generation to split capi and vintage tests generated files.
- Free retention duration property of it's 2 weeks limitation if the free storage allows it.
4.65.0 - 2024-01-29
- Add
CiliumNetworkPolicy
for all created Prometheuses. - Always set
shards
to 1 for all created Prometheuses.
4.64.0 - 2024-01-19
- Improved the blackhole routing for
stable-testing
MCs to silence more alerts related to test WCs
4.63.1 - 2023-12-12
- Fix Tinkerers slack receiver repeat_interval config.
4.63.0 - 2023-12-06
- Configure
gsoci.azurecr.io
as the default container image registry. - Change Tinkerers slack receiver repeat_interval to 2 weeks.
4.62.0 - 2023-11-27
- Group alerts by teams.
4.61.0 - 2023-11-22
- Upgrade to go 1.21
- Upgrade internal dependencies.
- Increased
group_wait
in alertmanager config from 1 to 5m.
4.60.0 - 2023-11-02
- Silence
ManagementClusterAppFailed
for WCs ofstable-testing
MCs.
4.59.0 - 2023-10-11
- Upgrade Prometheus to 2.47.1 and configure keepDroppedTargets to 5.
- Alert template: fix newlines / whitespace trimming if opsrecipe is not specified or a dashboard is specified.
4.58.0 - 2023-10-09
- Remove custom SLO handling in alertmanager config.
4.57.0 - 2023-10-04
- Fix Prometheus PSP by adding seccomp profile to RuntimeDefault.
- Handle
remoteTimeout
in RemoteWrite secret and set it to 60s (hardcoded to 30s withprometheus-agent < 0.6.4
).
- Remove the temporary code in pmo to avoid RemoteWriteSecret update on anteater/deu01 and anteater/seu01.
4.56.0 - 2023-10-03
- Set Prometheus seccomp profile to RuntimeDefault.
4.55.0 - 2023-10-02
- Add condition for PSP installation in helm chart
4.54.1 - 2023-09-28
- Routing rule for
ClusterUnhealthyPhase
and test clusters on stable-testing MCs to route to blackhole
4.54.0 - 2023-09-28
- Add cert-manager clusterIssuer configuration option for Ingresses.
- Add support for EKS as a provider.
4.53.0 - 2023-09-27
- Temporary avoid RemoteWriteSecret update on anteater/deu01 and anteater/seu01.
- Remove KVM related things that are not used anymore.
- Revert
prometheus-agent
max shards to 10 to prevent incessant paging.
4.52.0 - 2023-09-26
- Support absolute Grafana dashboard URLs.
- Increase
prometheus-agent
max shards to 50 to improve agent stability.
4.51.0 - 2023-09-25
- Ignore
PrometheusMetaOperatorReconcileErrors
alerts onstable-testing
. - Increase
group_wait
from AlertManager config to let more time to inhibition alerts to be executed.
4.50.0 - 2023-09-25
- Only send silenced page-level SLOTH alerts to
phoenix
's slack alert channel, rather than all alerts.
4.49.2 - 2023-09-22
- Reverted support absolute Grafana dashboard URLs.
4.49.1 - 2023-09-21
- Ignore kube-proxy target on EKS or clusters with observability bundle >= 0.8.3 (where the kube-proxy service monitor is enabled).
4.49.0 - 2023-09-21
- Adapt scrape targets to EKS clusters.
- computation of number of shards: rely on max number of series over the last 6h.
- Support absolute Grafana dashboard URLs.
- Fix api server url in case the CAPI provider sets https prefix in the CAPI CR status.
4.48.0 - 2023-09-19
- Support flux-managed clusters.
4.47.0 - 2023-09-14
- Enable Opsgenie alerts for Shield.
- Change source for the organization label.
4.46.0 - 2023-08-21
- Add team
tinkerers
slack router and receiver.
- Apply Kyverno policy exception to the PMO replicaset as well.
- Fix null receiver name.
- Remove
aws-load-balancer-controller
from the list of ignored targets.
4.45.1 - 2023-07-18
- Skip alerts named
WorkloadClusterApp*
instable-testing
installations.
4.45.0 - 2023-07-18
- When the cluster pipeline is set to stable-testing, only route management cluster alerts to opsgenie.
- Clean up unused targets (moved to service monitors).
- Remove #harbor-implementation slack integration.
4.44.0 - 2023-07-04
- Clean up some of the vintage targets.
4.43.2 - 2023-07-03
- Set number of shards to existing value if Prometheus is not reachable to avoid a race condition on cluster creation.
4.43.1 - 2023-06-29
- Fix some security concerns.
4.43.0 - 2023-06-29
- Change shard computation to be based on number of head series.
4.42.0 - 2023-06-26
- ReRoute clippy alerts to
phoenix slack
until all team labels are changed
4.41.0 - 2023-06-22
- Added scrape for
vault-etcd-backups-exporter
towards legacy vault VMs. - Add Kyverno Policy Exceptions.
4.40.0 - 2023-06-19
- Add back Prometheus CPU limits.
- Add alert routing for
team-turtles
- Move
gcp
,capa
andcapz
alerts to team phoenix.
- Update dropped labels in KSM metrics to avoid duplicate samples.
- Drop unused greedy KSM metrics.
- Remove imagePullSecrets towards https://github.com/giantswarm/giantswarm/issues/27267.
4.39.0 - 2023-06-07
- Prometheus CPU limits
4.38.2 - 2023-06-02
- Add static node-exporter target to the list of ignored targets because this target is needed for releases still using node-exporter < 1.14 (< aws 18.2.0).
4.38.1 - 2023-05-31
- Add alert routing for team bigmac
4.38.0 - 2023-05-22
- Dynamically compute agent shards number according to cluster size.
4.37.0 - 2023-05-10
- Add sharding capabilities to the Prometheus Agent.
- Create new remote-write-secret Secret and remote-write-config ConfigMap per cluster to not have a bad workaround in the observability bundle.
- Fix prometheus control plane node toleration.
- Stop pushing to
openstack-app-collection
.
4.36.4 - 2023-05-02
- Fix forgotten kube-state-metrics down source labels.
4.36.3 - 2023-05-02
- Add missing node label to kubelet.
4.36.2 - 2023-05-01
- Increased heartbeat delay before alert from 25min to 60min
- Updated alertmanager heartbeat config for faster response
4.36.1 - 2023-04-28
- Keep accepting 'master' as
role
label value for etcd scraping.
4.36.0 - 2023-04-27
- Deprecate
role=master
in favor ofrole=control-plane
.
4.35.4 - 2023-04-25
- Add Inhibition rule when prometheus-agent is down.
4.35.3 - 2023-04-18
- Change Atlas slack alert router to only route alerts with page and/or notify severity matcher.
- Fix list of ignored targets for Vintage WCs.
4.35.2 - 2023-04-13
- Allow PMO to patch secrets so it can remove finalizers.
4.35.1 - 2023-04-12
- Add finalizer to remote-write-config to block cluster deletion until PMO deleted the secret.
4.35.0 - 2023-04-11
- Handle prometheus scrape target removal based on the observability bundle version.
4.34.0 - 2023-04-06
- Add
loki
namespace in cAdvisor scrape config for MC.
- Fix proxy configuration as no_proxy was not respected.
4.33.0 - 2023-04-04
- Add more flexibility in the configuration so prometheus image, pvc size and so on can be overwritten by configuration.
4.32.0 - 2023-03-30
- Drop node-exporter metrics (
node_filesystem_files
node_filesystem_readonly
node_nfs_requests_total
node_network_carrier
node_network_transmit_colls_total
node_network_carrier_changes_total
node_network_transmit_packets_total
node_network_carrier_down_changes_total
node_network_carrier_up_changes_total
node_network_iface_id
node_xfs_.+
node_ethtool_.+
) - Drop kong metrics (
kong_latency_count
kong_latency_sum
) - Drop kube-state-metrics metrics (
kube_.+_metadata_resource_version
) - Drop nginx-ingress-controller metrics (
nginx_ingress_controller_bytes_sent_sum
nginx_ingress_controller_request_size_count
nginx_ingress_controller_response_size_count
nginx_ingress_controller_response_duration_seconds_sum
nginx_ingress_controller_response_duration_seconds_count
nginx_ingress_controller_ingress_upstream_latency_seconds
nginx_ingress_controller_ingress_upstream_latency_seconds_sum
nginx_ingress_controller_ingress_upstream_latency_seconds_count
)
4.31.1 - 2023-03-28
- Prometheus-agent tuning: revert maxSamplesPerSend to 150000
4.31.0 - 2023-03-28
- Drop
rest_client_rate_limiter_duration_seconds_bucket
rest_client_request_size_bytes_bucket
rest_client_response_size_bytes_bucket
from Kubernetes component metrics. - Drop
coredns_dns_response_size_bytes_bucket
andcoredns_dns_request_size_bytes_bucket
from coredns metrics. - Drop
nginx_ingress_controller_connect_duration_seconds_bucket
nginx_ingress_controller_header_duration_seconds_bucket
nginx_ingress_controller_bytes_sent_count
nginx_ingress_controller_request_duration_seconds_sum
from nginx-ingress-controller metrics. - Drop
kong_upstream_target_health
andkong_latency_bucket
Kong metrics. - Drop
kube_pod_tolerations
kube_pod_status_scheduled
kube_replicaset_metadata_generation
kube_replicaset_status_observed_generation
kube_replicaset_annotations
andkube_replicaset_status_fully_labeled_replicas
kube-state-metrics metrics. - Drop
promtail_request_duration_seconds_bucket
andloki_request_duration_seconds_bucket
metrics from promtail and loki.
4.30.0 - 2023-03-28
- Remove immutable secret deletion not needed after 4.27.0.
- Remove alertmanager ownership job.
4.29.2 - 2023-03-28
- VPA settings: set memory limit to 80% node size
- Drop
awscni_assigned_ip_per_cidr
metric from aws cni. - Drop
uid
label from kubelet. - Drop
image_id
label from kube-state-metrics.
4.29.1 - 2023-03-27
- Prometheus remotewrite endpoints for agents: increase max body size from 10m to 50m
- Removed pod_id relabelling as it's not needed anymore.
4.29.0 - 2023-03-27
- Bump Prometheus default image to
v2.43.0
- Prometheus-agent tuning: increase maxSamplesPerSend from 150000 to 300000
- Drop some unused metrics from cAdvisor.
- Remove draughtsman references.
4.28.0 - 2023-03-23
- Drop
id
andname
label from cAdvisor metrics.
4.27.0 - 2023-03-22
- Allow changes in the remote write api endpoint secret.
- The region as external Label for capa,gcp and capz
- Drop
uid
label from kube-state-metrics metrics. - Drop
container_id
label from kube-state-metrics metrics.
4.26.0 - 2023-03-20
- Prometheus resources: set requests=limits. Still allowing prometheus up to 90% of node capacity.
- Prometheus TSDB size: reduce it to 85% of disk space, to keep space for WAL before alerts fire.
- Prometheus-agent tuning: increase maxSamplesPerSend from 50000 to 150000
4.25.3 - 2023-03-15
- Fix ownership job
4.25.2 - 2023-03-15
- Updated
RetentionSize
property in Prometheus CR according to Volume Storage Size (90%) - Allow ownership job patch
alertmanagerConfigSelector
to fail in case the label has been already removed.
4.25.1 - 2023-03-14
- Followup on Alertmanager resource to Helm
- Set Alertmanager enabled by default.
- remove the label managed-by: pmo from alertmanagerConfigSelector.
4.25.0 - 2023-03-14
- Move Alertmanager resource to Helm
- Delete
controller resource alerting/alertmanager
. - Create alertmanager template in helm.
- Delete the obsolete static scraping configs for alertmanager.
- Add a hook job that change the ownership labels for alertmanager resource.
- Delete
4.24.1 - 2023-03-09
- VPA settings: changes in 4.24.0 were wrong, resulting in too low limits.
- Previous logic (4.23.0) was right, and limits were 90% node size.
- Comments have been updated for better understanding
- limit has been reverted to 90% node size
- code for CPU limits has been updated to do the same kind of calculations
- tests have been updated for more meaningful results
4.24.0 - 2023-03-02
- Un-drop
nginx_ingress_controller_request_duration_seconds_bucket
for workload clusters - Add additional annotations on all
ingress
objects to support DNS record creation viaexternal-dns
- VPA settings: changed max memory requests from 90% to 80% of node RAM, so that memory limit is 96% node RAM (avoids crashing node with big prometheis)
- VPA settings: remove useless config for
prometheus-config-reloader
andrules-configmap-reloader
: now it's only 1 container calledconfig-reloader
, and default config scales it down just nice!
4.23.0 - 2023-02-28
- Look up PrometheusRules CR in the whole MC only labelled with
application.giantswarm.io/team
4.22.0 - 2023-02-27
- Removed Prometheus readinessProbe 5mn delay; since there is already a 15mn startupProbe
- Increased ScrapeInterval and EvaluationInterval from 30s to 60s
- pros: twice less CPU usage, less disk usage
- cons: up to 30s more delay in alerts, and very short usage peaks get smoothed over 1 minute
- Addep
kyverno
namespace to WC & MC default scrape config for cadvisor metrics
4.21.0 - 2023-02-23
- Use resource.Quantity.AsApproximateFloat64() instead of AsInt64(), in order to avoid conversion issue when multiply cpu, e.g. 3880m
- Use label selector to selects only worker nodes for vpa resource to get maxCPU and maxMemory
- List nodes from API only once in VPA resource
- Improve VPA maxAllowedCPU, use 70% of the node allocatable CPU.
- Prevent 'notify' severity alerts from being sent to '#alert-phoenix' Slack channel (too noisy).
- Update getMaxCPU use 50% of the node allocatable CPU.
- Send SLO (sloth based) notify level alerts to '#alert-phoenix' Slack channel.
4.20.6 - 2023-02-13
- Fix list of targets to scrape or ignore.
4.20.5 - 2023-02-09
- Manage etcd certificates differently between CAPI/Vintage. On Vintage, etcd certificates are binded via a volume. On CAPI, certificates are binded via a secret.
- Pass the missing Provider property to
etcdcertificates.Config
- Add
.provider.flavor
property in Helm chart.
4.20.4 - 2023-02-07
- Fix certificates created as directories rathen than files
4.20.3 - 2023-02-02
- Fix heartbeat update mechanism to prevent leftover labels in OpsGenie.
4.20.2 - 2023-01-18
- Remove proxy support to remote write endpoint consumers.
- Add alertmanagerservicemonitor resource, to scrape metrics from alertmanager.
- Added target and source matchers for stack_failed label.
4.20.1 - 2023-01-17
- Enable
remote-write-receiver
viaEnableFeatures
field added inCommonPrometheusFields
(schema 0.62.0)
4.20.0 - 2023-01-17
- Upgrade
prometheus
from 2.39.1 to 2.41.0 andalertmanager
from 0.23.0 to 0.25.0.
4.19.2 - 2023-01-12
- Fix getDefaultAppVersion org namespace
- Bump alpine from 3.17.0 to 3.17.1
4.19.1 - 2023-01-11
- Add proxy support to remote write endpoint consumers.
- Fix node-exporter target discovery
4.19.0 - 2023-01-02
- remotewrite ingress allows bigger requests
- prometheus-agent: increase max samples per send.
⚠️ Warning: updates an immutable secret, will require manual actions at deployment.
4.18.0 - 2022-12-19
- Allow remote write over insecure endpoint certificate.
- Ignore remotewrite feature in kube-system namespace.
4.17.0 - 2022-12-14
- Deploy needed resources for the agent to run on Vintage MCs.
- opsgenie alert templating: list of firing instances
- slack alert templating: list of firing instances
- fix dashboard url
4.16.0 - 2022-12-07
- Change HasPrometheusAgent function to ignore prometheus-agent scraping targets on CAPA and CAPVCD.
- Do not reconcile service monitors in kube-system for CAPA and CAPVCD MCs.
- Change label selector used to discover
PodMonitors
andServiceMonitors
to avoid a duplicate scrape introduced in giantswarm/observability-bundle#18. - README: how to generate test
4.15.0 - 2022-12-05
- Send less alerts into Atlas alert slack channels (filtering out heartbeats and inhibitions)
- Opsgenie messages: revert to markdown
- Add capz provider
4.14.0 - 2022-11-30
- Change HasPrometheusAgent function to ignore prometheus-agent scraping targets on gcp.
- Do not reconcile service monitors in kube-system for gcp MCs .
4.13.0 - 2022-11-30
- Improve HasPrometheusAgent function to ignore prometheus-agent scraping targets.
- Bump alpine from 3.16.3 to 3.17.0
- Do not reconcile service monitors in CAPO MCs.
- Remove option to disable PVC creations only used on KVM installations.
- Remove deprecated ingress v1beta1 (only used on kvm).
4.12.0 - 2022-11-25
- Improve opsgenie notification template.
4.11.2 - 2022-11-24
- Remove
vault
targets for CAPI clusters.
4.11.1 - 2022-11-22
- Fix reconciliation issues on vintage MCs.
4.11.0 - 2022-11-18
- Remove the
CLUSTER-prometheus/app-exporter-CLUSTER/0
job in favor of Service Monitor provided by the app.
- Ensure the remote write endpoint configuration is enabled for MCs
- Add Inhibition rule for prometheus-agent to ignore clusters that doesn't deploy the agent.
- Send non-paging alert to Atlas slack channels.
- Fix a reconciliation bug on CAPI MC that were reconciled twice.
4.10.0 - 2022-11-15
- Fix scraping of controller-manager and kube-scheduler for vintage MCs.
- Old remotewrite secret
- Removed targets for clusters using the prometheus agent.
4.9.2 - 2022-11-03
- prometheus PSP: allow "projected" volumes
4.9.1 - 2022-10-31
- Change
remotewritesecret
to always delete the secret, as it's not needed anymore in favor ofRemoteWrite.spec.secrets
.
4.9.0 - 2022-10-28
- Added cadvisor scraping for
flux-*
namespaces.
4.8.1 - 2022-10-26
- Reduce label selector to find Prometheus PVC
4.8.0 - 2022-10-26
4.7.1 - 2022-10-21
- Fix alertmanager psp (add projected and downardAPI)
4.7.0 - 2022-10-20
- Add scraping of Cilium on Management Clusters.
- Add externalLabels for the remote write endpoint configuration.
- Configure working queue config for the remote write receiver (reducing max number of shards but increasing sample capacity).
- Fix remote write endpoint ingress buffer size to avoid the use of temporary buffer file in the ingress controller.
4.6.4 - 2022-10-17
- Move shield route so alerts for shield don't go to opsgenie at all, only to their slack.
4.6.3 - 2022-10-17
- Customize Prometheus volume size based via the
monitoring.giantswarm.io/prometheus-volume-size
annotation - Change remotewrite endpoint secrets namespace to clusterID ns.
- Add
.svc
suffix to the alertmanager target to make PMO work behind a corporate proxy. - Upgrade to go 1.19
- Bump prometheus-operator to v0.54.0
- Enable remote write receiver.
- Generate prometheus remote write agent secret and config.
- Configure prometheus remote write agent ingress.
- Add Slack channel for Team Shield.
4.6.2 - 2022-09-13
- Fix controller manager port to use default or a value from annotation.
- Fix scheduler port to use default or a value from annotation.
- Bump github.com/labstack/echo to v4.9.0 to fix sonatype-2022-5436 CVE.
4.6.1 - 2022-09-12
- Drop original
label_topology_kubernetes_io_region
&label_topology_kubernetes_io_zone
labels.
4.6.0 - 2022-09-12
- Relabeling for labels
label_topology_kubernetes_io_region
&label_topology_kubernetes_io_zone
toregion
&zone
.
4.5.1 - 2022-08-24
- Fix CAPI MCs being seen as workload cluster.
4.5.0 - 2022-08-24
- Change CAPI version from v1alpha3 to v1beta1.
4.4.1 - 2022-08-19
- Fix Team hydra config.
4.4.0 - 2022-08-17
- Add service priority as a tag in opsgenie alerts.
- Add Team Hydra receiver and route.
- Upgrade go-kit/kit to fix CVE-2022-24450 and CVE-2022-29946.
- Upgrade getsentry/sentry-go to fix CVE-2021-23772, CVE-2021-42576, CVE-2020-26892, and CVE-2021-3127.
4.3.0 - 2022-08-02
- Fix psp names for prometheus and alertmanager.
4.2.0 - 2022-07-28
- Set node-exporter namespace to
kube-system
for CAPI MCs and all WC, and tomonitoring
for vintage MCs. - Set cert-exporter namespace to
kube-system
for CAPI MCs and all WC, and tomonitoring
for vintage MCs.
- Added
pod_name
as a label to distinguish between multiple etcd pods when running in-cluster (e.g. CAPI).
- Push to
gcp-app-collection
.
- Bump alpine from 3.16.0 to 3.16.1
4.1.0 - 2022-07-20
- Upgrade operatorkit from v7.0.1 to v7.1.0.
- Upgrade github.com/sirupsen/logrus from 1.8.1 to 1.9.0.
- errors_total metric for each controller (comes with operatorkit upgrade).
- Cleanup of RemoteWrite Status (configuredPrometheuses, syncedSecrets) in case a cluster gets deleted.
4.0.1 - 2022-07-14
- Fix creation of new prometheus instance once a cluster has been created
4.0.0 - 2022-07-13
- Implement remotewrite CR logic, in order to configure Prometheus remotewrite config.
- Add HTTP_PROXY in remotewrite config
- Add unit tests for remotewrite resource
- Add Secrets field in the RemoteWrite CR
- Implement sync RemoteWrite Secrets logic
- Adding RemoteWrite.status field to ensure cleanup
- Add psp and service account for prometheus and alertmanager
- Rename
vcd
tocloud-director
- Monitor using a podmonitor.
- Fix API server discovery.
- Remove duplicate scrape config targets.
- Fix API server discovery.
- Add
patch
verb forremoteWrite
resources.
3.8.0 - 2022-06-30
- Add Secrets field in the RemoteWrite CR
3.7.0 - 2022-06-20
This release was created on release-v3.5.x branch to fix release 3.6.0 see PR#992
- Change remote write name to grafana-cloud.
3.6.0 - 2022-06-08
- Add remotewrite controller.
- Deployment of remoteWrite CRD in Helm chart
- Ignore remotewrite field when updating prometheus CR.
- Add
PodMonitor
support for workload cluster Prometheus.
- dependencies updates
- fix build by ignoring CVEs we can't fix for the moment
- Upgrade docker image from Alpine 3.15.1 to Alpine 3.16.0
- remoteWrite CustomResourceDefinition
3.5.0 - 2022-05-17
- Add Cluster Service Priority label.
- Add customer and organization label to metrics.
- Add VCD provider.
3.4.3 - 2022-05-10
- Add 5mn initial delay before performing readiness checks.
3.4.2 - 2022-05-09
- Use 'ip' node label as target to scrape etcd on MCs.
3.4.1 - 2022-05-05
- Fix CAPI cluster detection for legacy Management Clusters.
3.4.0 - 2022-05-04
- Add
PodMonitor
support on management clusters.
3.3.0 - 2022-05-04
- Add
nodepool
label tokube-state-metrics
metrics. - Improve CAPI cluster detection.
3.2.0 - 2022-04-13
- Change how MC managed with CAPI are reconciled in PMO (using the cluster CR instead of the Kubernetes Service)
- Fix etcd service discovery for CAPI clusters.
- Remove skip resource
3.1.0 - 2022-04-08
- Add support for etcd-certificates on OpenStack.
- Add context to generic resources.
- Add skip resource, to fix MC duplicated handling.
3.0.0 - 2022-03-28
- Add alertmanager ingress.
- Configure alertmanager and wire prometheus to both legacy and new alertmanagers.
- Remove deprecated matcher types from alertmanager config.
- Changed scrape_interval to 180s and scrape_timeout to 60s for azure-collector.
- Remove old teams from alertmanager template.
- Remove code to manage legacy alertmanager.
2.4.0 - 2022-03-16
- Migrate to rbac/v1 from rbac/v1beta1.
- Change additional scraping config to keep cadvisor metrics for
kong.*
named namespaces
- Do not trail right whitespaces in config.
2.3.0 - 2022-03-04
- Support ingress v1 by default.
- Scrape node-exporter trough apiserver proxy.
- Old references to Firecracker and Celestial replaced with Phoenix
2.2.1 - 2022-02-24
- Fix failing
aggregation:prometheus:memory_percentage
due to duplicated series from node exporter.
2.2.0 - 2022-01-20
- Allow overriding the scraping protocol
- Set ingress class name in ingress spec instead of annotation to prepare supporting ingress v1.
2.1.1 - 2022-01-12
- Prevent panic when encountering a different user in the CAPI kubeconfig.
2.1.0 - 2022-01-10
- Added support for OpenStack provider
2.0.0 - 2022-01-03
- Disable cluster-api controller on KVM installations.
- Disable legacy controller on AWS and Azure installations.
- Upgrade to Go 1.17
- Upgrade github.com/giantswarm/microkit v0.2.2 to v1.0.0
- Upgrade github.com/giantswarm/versionbundle v0.2.0 to v1.0.0
- Upgrade github.com/giantswarm/microendpoint v0.2.0 to v1.0.0
- Upgrade github.com/giantswarm/microerror v0.3.0 to v0.4.0
- Upgrade github.com/giantswarm/micrologger v0.5.0 to v0.6.0
- Upgrade github.com/spf13/viper v1.9.0 to v1.10.0
- Upgrade github.com/giantswarm/k8sclient v5.12.0 to v7.0.1
- Upgrade k8s.io/api v0.19.4 to v0.21.4
- Upgrade k8s.io/apiextensions-apiserver v0.19.4 to v0.21.4
- Upgrade sigs.k8s.io/controller-runtime v0.6.4 to v0.8.3
- Upgrade k8s.io/client-go v0.19.4 to v0.21.4
- Upgrade github.com/giantswarm/operatorkit v4.3.1 to v7.0.0
- Upgrade sigs.k8s.io/cluster-api v0.3.19 to v0.4.5
- Upgrade sigs.k8s.io/controller-runtime v0.8.3 to v0.9.7
- Upgrade github.com/prometheus-operator v0.50.0 to v0.52.1
- Remove k8sclient.G8sClient
- Remove versionbundle.Changelog
- Remove github.com/giantswarm/cluster-api v0.3.13-gs
1.53.0 - 2021-12-17
- Renamed
cancel_if_has_no_workers
inhibition tocancel_if_cluster_has_no_workers
to make it explicit it's about clusters and not node pools.
1.52.1 - 2021-12-14
- Fix relabeling for
__meta_kubernetes_service_annotation_giantswarm_io_monitoring_app_label
1.52.0 - 2021-12-13
- Add new inhibition for clusters without workers.
- Add relabeling for
__meta_kubernetes_service_annotation_giantswarm_io_monitoring_app_label
- Upgrade alertmanager to v0.23.0
- Upgrade prometheus-operator v0.49.0 to v0.50.0
- Avoid defaulting of
role
label (containing the role of the k8s node). If data is missing we can't reliably default it.
1.51.2 - 2021-10-28
- Fix finding certificates in organization namespaces.
- Remove cloud limit alerts from customer channel.
1.51.1 - 2021-09-10
- Re-introduce
v1alpha2
scheme.
1.51.0 - 2021-09-09
- Drop
v1alpha2
scheme. - Reconcile
v1alpha3
cluster.
- Do not create the legacy controller on new installations.
1.50.0 - 2021-08-16
- Upgrade prometheus-operator to v0.49.0
- Fix an issue where prometheus config is empty, due to missing serviceMonitorSelector.
1.49.0 - 2021-08-11
- Add
additionalScrapeConfigs
flag which accepts a string which will be appended to the management cluster scrape config template for installation specific configuration.
1.48.0 - 2021-08-09
- Add receiver and route for
#noise-falco
Slack channel.
1.47.0 - 2021-08-05
- Add the service label in the alert templates for the
ServiceLevelBurnRateTooHigh
alert. - Update Prometheus to 2.28.1.
- Allow the use of Prometheus Operator Service Monitor for management clusters.
1.46.0 - 2021-07-14
- Use
giantswarm/config
to generate managed configuration.
1.45.0 - 2021-06-28
- Use Grafana Cloud remote-write URL from config instead of hardcoding it, to allow overriding the URL in installations which can't access Grafana Cloud directly.
1.44.2 - 2021-06-24
1.44.1 - 2021-06-24
1.44.0 - 2021-06-23
- Migrate existing rules to https://github.com/giantswarm/prometheus-rules.
1.43.0 - 2021-06-22
- Removed
ServiceLevelBurnRateTicket
alert.
1.42.0 - 2021-06-22
- Removed
NodeExporterDown
alert and use SLO framework to monitor node-exporters. - Change
ServiceLevelBurnRateTooHigh
andServiceLevelBurnRateTooHighTicket
to opt-out for services.
1.41.2 - 2021-06-22
- Fix typo in
AzureClusterCreationFailed
andAzureClusterUpgradeFailed
1.41.1 - 2021-06-22
- Add term to not count api-server errors for clusters in transitioning state.
- Business-hours alert for azure clusters not updating in time.
- Increase
ManagementClusterWebhookDurationExceedsTimeout
duration from 5m to 15m.
- Fix CoreDNSMaxHPAReplicasReached alert to not fire in case max and min are equal.
- Business-hours alert for azure clusters not creating in time.
- Remove AlertManager ingress to avoid conflicts with the existing one, until the new AlertManager is ready to replace the one from g8s-prometheus
1.41.0 - 2021-06-17
- Add
AppPendingUpdate
alert. - Add scrapeconfig for
falco-exporter
on management clusters. - Add Alertmanager managed by Prometheus Operator.
- Add Alertmanager ingress.
- Add
WorkloadClusterDeploymentNotSatisfiedLudacris
to monitormetrics-server
in workload clusters. - Add
CoreDNSMaxHPAReplicasReached
business hours alert for when CoreDNS has been scaled to its maximum for too long.
- Lower Prometheus disk space alert from 10% to 5%.
- Change severity of
ChartOperatorDown
alert to notify. - Merge all provider certificate.management-cluster.rules into one prometheus rule.
- Fix service name in ingress.
1.40.0 - 2021-06-14
- Lower
kubelet
SLO from 99.9% to 99%.
1.39.0 - 2021-06-11
- Add ServiceLevelBurnRateTicket alert.
- Add the prometheus log level option
- Add high and low burn rates as recording rules.
- Move managed apps SLO alerts to the service-level format.
- Set
HighNumberOfAllocatedSockets
to notify not page - Extract
kubelet
andapi-server
SLO targets to their own recording rules. - Extract
kubelet
andapi-server
alerting thresholds to their own recording rules. - Change
ServiceLevelBurnRateTooHigh
to use new created values.
- Fixed the way VPA
maxAllowed
parameter for memory is calculated so that we avoid going over node memory capacity with the memory limit (maxAllowed
is used for request and limit is that multiplied by 1.2).
1.38.0 - 2021-05-28
- Increased alert duration of
PrometheusCantCommunicateWithKubernetesAPI
. - Refactor resources to namespace monitoring and alerting code.
- Add cluster-autoscaler to
WorkloadClusterContainerIsRestartingTooFrequentlyFirecracker
- Remove
tlscleanup
andvolumeresizehack
resources as they are not needed anymore.
1.37.0 - 2021-05-26
- Add HTTP proxy support to Prometheus Remote Write.
1.36.0 - 2021-05-25
- Added alert
HighNumberOfAllocatedSockets
for High number of allocated sockets - Added alert
HighNumberOfOrphanedSockets
for High number of orphaned sockets - Added alert
HighNumberOfTimeWaitSockets
for High number of time wait sockets - Added alert
AWSWorkloadClusterNodeTooManyAutoTermination
for terminate unhealthy feature. - Preserve and merge global HTTP client config when generating heartbeat receivers in AlertManager config; this allows it to be used in environments where internet access is only allowed through a proxy.
- Include
cluster-api-core-unique-webhook
intoDeploymentNotSatisfiedFirecracker
andDeploymentNotSatisfiedChinaFirecracker
. - Increased duration for
PrometheusPersistentVolumeSpaceTooLow
alert - Increased duration for
WorkloadClusterEtcdDBSizeTooLarge
alert. - Increased duration for
WorkloadClusterEtcdHasNoLeader
alert. - Silence
OperatorkitErrorRateTooHighCelestial
andOperatorkitCRNotDeletedCelestial
outside working hours. - Update Prometheus to 2.27.1
- Add atlas, and installation tag onto Heartbeats.
- Fix
PrometheusFailsToCommunicateWithRemoteStorageAPI
alert not firing on china clusters.
1.35.0 - 2021-05-12
- Add alert
alertmanager-dashboard
not satisfied.
1.34.1 - 2021-05-10
- inhibit KubeStateMetricsDown and KubeStateMetricsMissing
1.34.0 - 2021-05-06
- Lower the severity to notify for managed app's error budget alerts
- Fix ManagedApp alert
- Fix
InhibitionKubeStateMetricsDown
not firing long enough
1.33.0 - 2021-04-27
- Raise prometheus cpu limit to 150%.
- Remove
PodLimitAlmostReachedAWS
andEBSVolumeMountErrors
alerts as they were not used.
1.32.1 - 2021-04-22
- Adjust container restarting too often firecracker.
1.32.0 - 2021-04-19
- Add alert for
kube-state-metrics
missing. - Tune remote write configuration to avoid loss of data.
- Only fire
KubeStateMetricsDown
ifkube-state-metrics
is down.
1.31.0 - 2021-04-16
- Page firecracker for failed cluster transitions.
- Page Firecracker in working hours for restarting containers.
- Add recording rules for kube-mixins
MatchingNumberOfPrometheusAndCluster
now has a runbook, link added to alert.
- Keep the
container_network.*
metrics as they are needed for the kubernetes mixins dashboards
1.30.0 - 2021-04-12
- Remove Gatekeeper alerts and targets.
1.29.1 - 2021-04-09
- Fix inhibition for
MatchingNumberOfPrometheusAndCluster
alert by matching it with source from Management Cluster instead of the cluster the alert is firing for.
1.29.0 - 2021-04-09
- Add
PrometheusCantCommunicateWithRemoteStorageAPI
to alert when Prometheus fails to send samples to Cortex. - Add workload type and name labels for
ManagedAppBasicError*
alerts - Add alert for master node in HA setup down for too long.
- Add aggregation for docker actions.
- Fix prometheus storage alert
- Removed unnecessary whitespace in additional scrape configs.
1.28.0 - 2021-04-01
- Add support to calculate maximum CPU.
- Include cadvisor metrics from the pod in
draughtsman
namespace. - Add
PrometheusPersistentVolumeSpaceTooLow
alert for prometheus storage going over 90 percent.
- Split
ManagementClusterCertificateWillExpireInLessThanTwoWeeks
alert per provider. - Increased duration time for flapping
WorkloadClusterWebhookDurationExceedsTimeout
alert
- Changed prometheus volume space alert ownership to atlas:
PersistentVolumeSpaceTooLow
->PrometheusPersistentVolumeSpaceTooLow
- Do not monitor docker for CAPI clusters
- Remove promxy resource.
1.27.4 - 2021-03-26
- Add recording rules for dex activity, creating the metrics
aggregation:dex_requests_status_ok
aggregation:dex_requests_status_4xx
aggregation:dex_requests_status_5xx
1.27.3 - 2021-03-25
- Fix prometheus/common secret token in imported code.
1.27.2 - 2021-03-25
- Fix alertmanager secretToken in imported alertmanager code.
1.27.1 - 2021-03-25
- Remove follow_redirects from alertmanager config
- Update prometheus/[email protected]
- Update prometheus/[email protected]
1.27.0 - 2021-03-24
- Update architect to 2.4.2
- Removed memory-intensive notify only systemd alerts.
1.26.0 - 2021-03-24
- Push to
shared-app-collection
- Rename
EtcdWorkloadClusterDown
toWorkloadClusterEtcdDown
- Increased memory limits by 1.2 factor
- Support vmware for
WorkloadClusterEtcdDown
- Add vmware to the list of valid providers
1.25.2 - 2021-03-23
- Disable follow redirect for alertmanager
1.25.1 - 2021-03-22
- Set prometheus minimum CPU to 100m
1.25.0 - 2021-03-22
- Add support for monitoring vmware clusters
- Add support to get the API Server URL for both legacy and CAPI clusters
- Upgrade ingress version to networking.k8s.io/v1beta1
- Fix typo in
MatchingNumberOfPrometheusAndCluster
alert - Fix scrapeconfig to use secured ports for kubernetes control plane components for CAPI clusters
- Fix scrapeconfig to proxy all calls through the API Server for CAPI clusters
1.24.8 - 2021-03-18
- Avoid alerting for
MatchingNumberOfPrometheusAndCluster
when a cluster is being deleted.
1.24.7 - 2021-03-18
- Add support to copy CAPI cluster's certificates
- Add aggregation
aggregation:giantswarm:api_auth_giantswarm_successful_attempts_total
.
1.24.6 - 2021-03-02
- Fix equality check on the VPA CR to prevent it being overriden and losing it's status information on every prometheus-meta-operator deployment.
- Inhibit
MatchingNumberOfPrometheusAndCluster
when kube-state-metrics is down to prevent bogus pages whenkube_pod_container_status_running
metric isn't available
1.24.5 - 2021-03-02
- Set the prometheus UI Web page title.
- Add 'app' label to metrics pushed from
app-exporter
to cortex
1.24.4 - 2021-02-26
- Avoid alerting for ETCD backups outside business hours.
1.24.3 - 2021-02-24
- Use
resident_memory
when calculating docker memory usage.
1.24.2 - 2021-02-24
- Add 'catalog' label to metrics pushed from
app-exporter
to cortex
1.24.1 - 2021-02-23
- Fixed syntax error in expressions of
ManagementClusterPodPending*
alerts
1.24.0 - 2021-02-23
- Add Alert for missing prometheus for a workload cluster
- Add
ManagementClusterPodStuckFirecracker
andWorkloadClusterPodStuckFirecracker
alerts for Firecracker. - Add
ManagementClusterPodStuckCelestial
alert for Celestial. - Send samples per second to cortex
- Move Cluster Autoscaller app installation/upgrade related alerts to team Batman.
1.23.1 - 2021-02-22
- Add
TestClusterTooOld
for testing installations - Added Mayu as a scrape target as well as puma's pods
- Apply prometheus rule group (which includes
- Discover ETCD targets through the LoadBalancer using the
giantswarm.io/etcd-domain
annotation
- Remove
PersistentVolumeSpaceTooLow
from Workload Clusters.
1.23.0 - 2021-02-17
- Add the sig-customer alerts:
WorkloadClusterCertificateWillExpireInLessThanAMonth
WorkloadClusterCertificateWillExpireMetricMissing
- Add the ludacris alerts:
CadvisorDown
CalicoRestartRateTooHigh
CertOperatorVaultTokenAlmostExpiredMissing
CertOperatorVaultTokenAlmostExpired
ClusterServiceVaultTokenAlmostExpiredMissing
ClusterServiceVaultTokenAlmostExpired
CollidingOperatorsLudacris
CoreDNSCPUUsageTooHigh
CoreDNSDeploymentNotSatisfied
CoreDNSLatencyTooHigh
DeploymentNotSatisfiedLudacris
and assign it to rocketDeploymentNotSatisfiedRocket
DockerMemoryUsageTooHigh
for both Ludacris and BiscuitDockerVolumeSpaceTooLow
for both Ludacris and BiscuitEtcdVolumeSpaceTooLow
for both Ludacris and BiscuitJobFailed
renamed toManagementClusterJobFailed
KubeConfigMapCreatedMetricMissing
KubeDaemonSetCreatedMetricMissing
KubeDeploymentCreatedMetricMissing
KubeEndpointCreatedMetricMissing
KubeNamespaceCreatedMetricMissing
KubeNodeCreatedMetricMissing
KubePodCreatedMetricMissing
KubeReplicaSetCreatedMetricMissing
KubeSecretCreatedMetricMissing
KubeServiceCreatedMetricMissing
KubeStateMetricsDown
KubeletConditionBad
KubeletDockerOperationsErrorsTooHigh
KubeletDockerOperationsLatencyTooHigh
KubeletPLEGLatencyTooHigh
KubeletVolumeSpaceTooLow
for both Ludacris and BiscuitLogVolumeSpaceTooLow
for both Ludacris and BiscuitMachineAllocatedFileDescriptorsTooHigh
MachineEntropyTooLow
MachineLoadTooHigh
and moved it to biscuitMachineMemoryUsageTooHigh
and moved it to biscuitManagementClusterAPIServerAdmissionWebhookErrors
ManagementClusterAPIServerLatencyTooHigh
ManagementClusterContainerIsRestartingTooFrequently
ManagementClusterCriticalSystemdUnitFailed
ManagementClusterDaemonSetNotSatisfiedLudacris
ManagementClusterDaemonSetNotSatisfiedLudacris
ManagementClusterDisabledSystemdUnitActive
ManagementClusterHighNumberSystemdUnits
ManagementClusterNetExporterCPUUsageTooHigh
ManagementClusterSystemdUnitFailed
ManagementClusterWebhookDurationExceedsTimeout
Network95thPercentileLatencyTooHigh
NetworkCheckErrorRateTooHigh
NodeConnTrackAlmostExhausted
NodeExporterCollectorFailed
NodeExporterDeviceError
NodeExporterDown
NodeExporterMissing
NodeHasConstantOOMKills
NodeStateFlappingUnderLoad
OperatorNotReconcilingLudacris
OperatorkitErrorRateTooHighLudacris
PersistentVolumeSpaceTooLow
for both Ludacris and BiscuitReleaseNotReady
RootVolumeSpaceTooLow
for both Ludacris and BiscuitSYNRetransmissionRateTooHigh
ServiceLevelBurnRateTooHigh
WorkloadClusterAPIServerAdmissionWebhookErrors
WorkloadClusterAPIServerLatencyTooHigh
WorkloadClusterCriticalSystemdUnitFailed
WorkloadClusterDaemonSetNotSatisfiedLudacris
WorkloadClusterDisabledSystemdUnitActive
WorkloadClusterHighNumberSystemdUnits
WorkloadClusterNetExporterCPUUsageTooHigh
WorkloadClusterSystemdUnitFailed
WorkloadClusterWebhookDurationExceedsTimeout
- Migrate and rename
EBSVolumeMountErrors
toManagementClusterEBSVolumeMountErrors
andWorkloadClusterEBSVolumeMountErrors
- Removing legacy finalizers resource used to remove old custom resource finalizers
1.22.0 - 2021-02-16
- Improved inhibition alert
InhibitionClusterStatusUpdating
to inhibit alerts 10 minutes after the update has finished to avoid unecessery pages.
1.21.0 - 2021-02-16
- Split
ManagementClusterAppFailed
per team
- Add the solution engineer alerts:
AzureQuotaUsageApproachingLimit
NATGatewaysPerVPCApproachingLimit
ServiceUsageApproachingLimit
1.20.0 - 2021-02-16
- Add the rocket alerts:
BackendServerUP
ClockOutOfSyncKVM
CollidingOperatorsRocket
DNSCheckErrorRateTooHighKVM
DNSErrorRateTooHighKVM
EtcdWorkloadClusterDownKVM
IngressExporterDown
KVMManagementClusterDeploymentScaledDownToZero
KVMNetworkErrorRateTooHigh
ManagementClusterCriticalPodMetricMissingKVM
ManagementClusterCriticalPodNotRunningKVM
ManagementClusterMasterNodeMissingRocket
ManagementClusterPodLimitAlmostReachedKVM
ManagementClusterPodPendingFor15Min
MayuSystemdUnitIsNotRunning
NetworkInterfaceLeftoverWithoutCluster
OnpremManagementClusterMissingNodes
OperatorNotReconcilingRocket
OperatorkitCRNotDeletedRocket
OperatorkitErrorRateTooHighRocket
WorkloadClusterCriticalPodMetricMissingKVM
WorkloadClusterCriticalPodNotRunningKVM
WorkloadClusterEndpointIPDown
WorkloadClusterEtcdCommitDurationTooHighKVM
WorkloadClusterEtcdDBSizeTooLargeKVM
WorkloadClusterEtcdHasNoLeaderKVM
WorkloadClusterEtcdNumberOfLeaderChangesTooHighKVM
WorkloadClusterMasterNodeMissingRocket
WorkloadClusterPodLimitAlmostReachedKVM
- Added the firecracker rules to PMO:
AWSClusterCreationFailed
AWSClusterUpdateFailed
AWSManagementClusterDeploymentScaledDownToZero
AWSManagementClusterMissingNodes
AWSNetworkErrorRateTooHigh
ClockOutOfSyncAWS
CloudFormationStackFailed
CloudFormationStackRollback
ClusterAutoscalerAppFailedAWS
ClusterAutoscalerAppNotInstalledAWS
ClusterAutoscalerAppPendingInstallAWS
ClusterAutoscalerAppPendingUpgradeAWS
CollidingOperatorsFirecracker
ContainerIsRestartingTooFrequentlyFirecracker
CredentialdCantReachKubernetes
DNSCheckErrorRateTooHighAWS
DNSErrorRateTooHighAWS
DefaultCredentialsMissing
DeploymentNotSatisfiedChinaFirecracker
DeploymentNotSatisfiedFirecracker
ELBHostsOutOfService
EtcdWorkloadClusterDownAWS
FluentdMemoryHighUtilization
JobHasNotBeenScheduledForTooLong
KiamMetadataFindRoleErrors
ManagementClusterDaemonSetNotSatisfiedChinaFirecracker
ManagementClusterDaemonSetNotSatisfiedFirecracker
OperatorNotReconcilingFirecracker
OperatorkitCRNotDeletedFirecracker
OperatorkitErrorRateTooHighFirecracker
TooManyCredentialsForOrganization
TrustedAdvisorErroring
WorkloadClusterCriticalPodNotRunningAWS
WorkloadClusterCriticalPodMetricMissingAWS
WorkloadClusterDaemonSetNotSatisfiedFirecracker
WorkloadClusterEtcdCommitDurationTooHighAWS
WorkloadClusterEtcdDBSizeTooLargeAWS
WorkloadClusterEtcdHasNoLeaderAWS
WorkloadClusterEtcdNumberOfLeaderChangesTooHighAWS
WorkloadClusterMasterNodeMissingFirecracker
WorkloadClusterPodLimitAlmostReachedAWS
- Splitting
NodeIsUnschedulable
per team - Split
ContainerIsRestartingTooFrequentlyFirecracker
intoWorkloadClusterContainerIsRestartingTooFrequentlyFirecracker
andManagementClusterContainerIsRestartingTooFrequentlyFirecracker
- Add the following biscuit alerts to split alerts between workload and management cluster:
ManagementClusterCriticalPodNotRunning
ManagementClusterCriticalPodMetricMissing
ManagementClusterPodLimitAlmostReached
- Move
AzureManagementClusterMissingNodes
andAWSManagementClusterMissingNodes
to team biscuitManagementClusterMissingNodes
- Move
ManagementClusterPodStuckAzure
andManagementClusterPodStuckAWS
to team biscuitManagementClusterPodPendingFor15Min
- Renamed the following alerts:
AzureClusterAutoscalerIsRestartingFrequently
->WorkloadClusterAutoscalerIsRestartingFrequentlyAzure
CriticalPodNotRunningAzure
->WorkloadClusterCriticalPodNotRunningAzure
CriticalPodMetricMissingAzure
->WorkloadClusterCriticalPodMetricMissingAzure
MasterNodeMissingCelestial
->WorkloadClusterMasterNodeMissingCelestial
NodeUnexpectedTaintNodeWithImpairedVolumes
->WorkloadClusterNodeUnexpectedTaintNodeWithImpairedVolumes
PodLimitAlmostReachedAzure
->WorkloadClusterPodLimitAlmostReachedAzure
- Do not page biscuit for a failing prometheus
1.19.2 - 2021-02-12
- Fix incorrect prometheus memory usage recording rules after we migrated to the new monitoring setup
- Use azure-collector instead of azure-operator in
AzureClusterCreationFailed
alert
- Removing service monitor resource used to clean up unused service monitor CR
1.19.1 - 2021-02-10
- Fix empty prometheus rules in helm template issues for aws and kvm installations
1.19.0 - 2021-02-10
- Added the celestial rules to PMO:
AzureClusterAutoscalerIsRestartingFrequently
AzureClusterCreationFailed
AzureDeploymentIsRunningForTooLong
AzureDeploymentStatusFailed
AzureManagementClusterDeploymentScaledDownToZero
AzureManagementClusterMissingNodes
AzureNetworkErrorRateTooHigh
AzureServicePrincipalExpirationDateUnknown
AzureServicePrincipalExpiresInOneMonth
AzureServicePrincipalExpiresInOneWeek
AzureVMSSRateLimit30MinutesAlmostReached
AzureVMSSRateLimit30MinutesReached
AzureVMSSRateLimit3MinutesAlmostReached
AzureVMSSRateLimit3MinutesReached
ClockOutOfSyncAzure
ClusterAutoscalerAppFailedAzure
ClusterAutoscalerAppNotInstalledAzure
ClusterAutoscalerAppPendingInstallAzure
ClusterAutoscalerAppPendingUpgradeAzure
ClusterWithNoResourceGroup
CollidingOperatorsCelestial
CriticalPodMetricMissingAzure
CriticalPodNotRunningAzure
DNSCheckErrorRateTooHighAzure
DNSErrorRateTooHighAzure
DeploymentNotSatisfiedCelestial
EtcdWorkloadClusterDownAzure
LatestETCDBackup1DayOld
LatestETCDBackup2DaysOld
ManagementClusterNotBackedUp24h
MasterNodeMissingCelestial
OperatorNotReconcilingCelestial
OperatorkitCRNotDeletedCelestial
OperatorkitErrorRateTooHighCelestial
PodLimitAlmostReachedAzure
ManagementClusterPodStuckAzure
(renamed fromPodStuckAzure
)ReadsRateLimitAlmostReached
VPNConnectionProvisioningStateBad
VPNConnectionStatusBad
WorkloadClusterEtcdCommitDurationTooHighAzure
WorkloadClusterEtcdDBSizeTooLargeAzure
WorkloadClusterEtcdHasNoLeaderAzure
WorkloadClusterEtcdNumberOfLeaderChangesTooHighAzure
WritesRateLimitAlmostReached
ETCDBackupJobFailedOrStuck
(renamed fromBackupJobFailedOrStuck
)
- Added node
role
label tokubelet
metrics as it's needed byMasterNodeMissingCelestial
alert
- Removed axolotl from Chinese rules as the installation has been decommissioned
1.18.0 - 2021-02-08
- Added the batman alerts to PMO:
AppExporterDown
AppOperatorNotReady
AppWithoutTeamLabel
CertManagerPodHighMemoryUsage
CertificateSecretWillExpireInLessThanTwoWeeks
ChartOperatorDown
ChartOrphanConfigMap
ChartOrphanSecret
CollidingOperatorsBatman
CordonedAppExpired
DeploymentNotSatisfiedBatman
DeploymentNotSatisfiedChinaBatman
ElasticsearchClusterHealthStatusRed
ElasticsearchClusterHealthStatusYellow
ElasticsearchDataVolumeSpaceTooLow
ElasticsearchHeapUsageWarning
ElasticsearchPendingTasksTooHigh
ExternalDNSCantAccessRegistry
ExternalDNSCantAccessSource
HelmHistorySecretCountTooHigh
IngressControllerDeploymentNotSatisfied
IngressControllerMemoryUsageTooHigh
IngressControllerReplicaSetNumberTooHigh
IngressControllerSSLCertificateWillExpireSoon
IngressControllerServiceHasNoEndpoints
ManagedAppBasicErrorBudgetBurnRateAboveSafeLevel
ManagedAppBasicErrorBudgetBurnRateInLast10mTooHigh
ManagedAppBasicErrorBudgetEstimationWarning
ManagedLoggingElasticsearchClusterDown
ManagedLoggingElasticsearchDataNodesNotSatisfied
ManagementClusterAppFailed
OperatorNotReconcilingBatman
OperatorkitErrorRateTooHighBatman
RepeatedHelmOperation
TillerHistoryConfigMapCountTooHigh
TillerRunningPods
TillerUnreachable
WorkloadClusterAppFailed
WorkloadClusterDeploymentNotSatisfied
WorkloadClusterDeploymentScaledDownToZero
WorkloadClusterManagedDeploymentNotSatisfied
1.17.2 - 2021-02-04
- (internal) Rely on
Ingress
for OAuth2 proxy to configure TLS for Prometheus domain, as it also configures management of the certificates, instead of creating copies which could break access in case they became out of date.
- Fix incorrect prometheus memory usage recording rule
1.17.1 - 2021-02-02
- Fixed incorrect label in GatekeeperDown alert.
1.17.0 - 2021-02-02
- Added the
NoHealthyJumphost
alert - Added the biscuit alerts to PMO:
AppCollectionDeploymentFailed
CalicoNodeMemoryHighUtilization
CrsyncDeploymentNotSatisfied
CrsyncTooManyTagsMissing
DeploymentNotSatisfiedBiscuit
DeploymentNotSatisfiedChinaBiscuit
DraughtsmanRateLimitAlmostReached
EtcdDown
GatekeeperDown
GatekeeperWebhookMissing
KeyPairStorageAlmostFull
ManagementClusterHasLessThanThreeNodes
ManagementClusterCriticalSystemdUnitFailed
ManagementClusterDisabledSystemdUnitActive
ManagementClusterEtcdCommitDurationTooHigh
ManagementClusterEtcdDBSizeTooLarge
ManagementClusterEtcdHasNoLeader
ManagementClusterEtcdNumberOfLeaderChangesTooHigh
ManagementClusterHighNumberSystemdUnits
ManagementClusterPodPending
ManagementClusterSystemdUnitFailed
VaultIsDown
VaultIsSealed
- Renamed control plane and tenant cluster respectively to management cluster and
workload cluster. Renamed some alerts:
- ControlPlaneCertificateWillExpireInLessThanTwoWeeks > ManagementClusterCertificateWillExpireInLessThanTwoWeeks
- ControlPlaneDaemonSetNotSatisfiedAtlas > ManagementClusterDaemonSetNotSatisfiedAtlas
- ControlPlaneDaemonSetNotSatisfiedChinaAtlas > ManagementClusterDaemonSetNotSatisfiedChinaAtlas
- PrometheusCantCommunicateWithTenantAPI > PrometheusCantCommunicateWithKubernetesAPI
- Rename ETCDDown alert to ManagementClusterEtcdDown
- Enable alerts only on the corresponding providers
- Fix missing app label on kube-apiserver target
- Fix missing app label on nginx-ingress-controller target
1.16.1 - 2021-01-28
- Fix recording rules to apply them to all prometheuses
1.16.0 - 2021-01-28
- Reenable
Remote Write
to Cortex
- Trigger final heartbeat before deleting the cluster to clean up opened heartbeat alerts
- Remove webhook from
AlertManagerNotificationsFailing
alert.
1.15.0 - 2021-01-22
- Use giantswarm/prometheus image
- Fix recording rules creation
- Fix prometheus container image tag to not use latest
- Fix prometheus minimal memory in VPA
1.14.0 - 2021-01-13
- Add inhibition rules.
- Set Prometheus pod max memory usage (via vpa) to 90% of lowest node allocatable memory
- Prometheus monitors itself
- Ignore missing unhealthy prometheus instances in promxy to avoid it from crash looping
- Added the biscuit alerts to PMO:
ControlPlaneCertificateWillExpireInLessThanTwoWeeks
- Add topologySpreadConstraint to evenly spread prometheus pods
- Ignore Slack in
AlertManagerNotificationsFailing
alert. - Set heartbeat alert to up for 10mn
- Removed g8s-prometheus target
- Removed alert resource
1.13.0 - 2021-01-05
- Add priority class
prometheus
and use it for all managed Prometheus pods in order to allow scheduler to evict other pods with lower priority to make space for Prometheus
1.12.0 - 2020-12-02
- Change PrometheusCantCommunicateWithTenantAPI to ignore promxy
- Set prometheus default resources to 100m of CPU and 1Gi of memory
- Reduced number of metrics ingested from nginx-ingress-controller in order to reduce memory requirements of Prometheus.
1.11.0 - 2020-12-01
- Create
VerticalPodAutoscaler
resource for each Prometheus configuring the VPA to manage Prometheus pod requests and limits to allow dynamic scaling but prevent scheduling and OOM issues.
- Change prometheus affinity from "Prefer" to "Required".
1.10.3 - 2020-11-25
- Fix initial heartbeat ping so that it only triggers on creation.
1.10.2 - 2020-11-25
- Set prometheus cpu requests and limits to 0.25 CPU.
1.10.1 - 2020-11-24
- Set prometheus cpu requests and limits to 1 CPU.
- Set prometheus memory requests and limits to 5Gi.
1.10.0 - 2020-11-20
- Add team atlas alerts (in helm chart).
- Set heartbeat client log level to fatal to avoid polluting our logs.
- Set prometheus to select rules from monitoring namespace.
- Set alert resource to delete PrometheusRules in cluster namespace.
- Fix prometheus targets.
- Fix duplicated scrapping of nginx-ingress-controller.
1.9.0 - 2020-11-11
- Add support for
Remote Write
to Cortex - Added recording rules
- Add node affinity to prefer not scheduling on master nodes
- Added
pipeline
tag to Hearbeat alert to be able to see if it affects a stable or testing installation at first glance
- Increase memory request from 100Mi to 5Gi
- Fix kube-state-metrics scraping port on Control Planes.
- Fixed creating of alerts, it was failing due to a typo in template path
1.8.0 - 2020-10-21
- Add pod, container, node and node role labels
- Allow ignoring clusters using the
giantswarm.io/monitoring: false
label on cluster CRs - Add monitoring of control plane bastions
- Add heartbeat alert to prometheus
- Create heartbeat in opsgenie
- Route heartbeat alerts to corresponding opsgenie heartbeat
1.7.0 - 2020-10-14
- Add alertmanager config
- Fix a bug where promxy configmap keep growing and lead to OutOfMemory issues.
- Fix an issue where prometheus fails to be created due to resource order.
1.6.0 - 2020-10-12
- Set retention size to 90Gi and duration to 2w
- Increased storage to 100Gi
1.5.1 - 2020-10-07
- Fix promxy config marshaling
- Fix promxy config not being updated
1.5.0 - 2020-10-07
- Support for managing Promxy configuration
- Old namespace deleter resource
1.4.0 - 2020-09-25
- Add oauth ingress
- Add tls certificate for ingress
- Add ingress for individual prometheuses
1.3.0 - 2020-09-24
- Scraping of tenant cluster prometheus
- Scraping of control plane prometheus
- Add installation label
- Add labelling schema alert
- Set honor labels to true
- Change control plane namespace to reflect the installation name instead of 'kubernetes'
1.2.0 - 2020-09-03
- Add monitoring label
- Add etcd target for control planes
- Add vault target
- Add gatekeeper target
- Add managed-app target
- Add cert-operator target
- Add bridge-operator target
- Add flannel-operator target
- Add ingress-exporter target
- Add coreDNS target
- Add azure-collector target
- frontend, ingress, and service resources.
- prevented data loss in
Cluster
resources by always using the correct version of the type as configured in CRDs storage version (#101) - avoids trying to read dependant objects from the cluster when processing deletion, as they may be gone already and errors here were disrupting cleanup and preventing the finalizer from being removed (#115)
1.1.0 - 2020-08-27
- Scraping of the control plane operators
- aws-operator
- azure-operator
- kvm-operator
- app-operator
- chart-operator
- cluster-operator
- etcd-backup-operator
- node-operator
- release-operator
- organization-operator
- prometheus-meta-operator
- rbac-operator
- draughtsman
- Scraping of the monitoring targets
- app-exporter
- cert-exporter
- vault-exporter
- node-exporter
- net-exporter
- kube-state-metrics
- alertmanager
- grafana
- prometheus
- prometheus-config-controller
- fluentbit
- Scraping of the control plane apis
- tokend
- companyd
- userd
- api
- kubernetesd
- credentiald
- cluster-service
- New control-plane controller, reconciling kubernetes api service (#92)
1.0.1 - 2020-08-25
- Rename controller name and finalizers
1.0.0 - 2020-08-20
- Scraping of kube-proxy (#88)
- Scraping of kube-scheduler (#87)
- Scraping of kube-controller-manager (#85)
- Scraping of etcd (#81)
- Scraping of kubelet (#82)
- Scraping of legacy docker, calico-node, cluster-autoscaler, aws-node and cadvisor (#78)
- Moved prometheus storage from
emptyDir
to apersistentVolumeClaim
- Remove tenant cluster prometheus limits
- Updated backward incompatible Kubernetes dependencies to v1.18.5.
0.3.2 - 2020-07-24
- Set TC prometheus memory limit to 1Gi (#73)
0.3.1 - 2020-07-17
- Set TC prometheus memory limit to 200Mi
0.3.0 - 2020-07-15
- Scale prometheus-meta-operator replicas back to one.
- Set prometheus request/limits (cpu: 100m, memory: 100Mi)
0.2.1 - 2020-07-01
- Fixed release process
0.2.0 - 2020-06-29
- Add service monitor for nginx-ingress-controller
- Reconcile CAPI (Cluster) and legacy cluster CRs (AWSConfig, AzureConfig, KVMConfig)
- Reduced prometheus server replicas to one (#45)
- Reduced default prometheus-meta-operator replicas to zero as having both this and previous (g8s-prometheus) solutions on at the same time is overloading some control planes
- Removed cortex frontend as it's an optimisation that's not currently needed
- Removed service and ingress resources as they are no longer needed (they were used for the cortex frontend)
- Fix an error during alert update: metadata.resourceVersion: Invalid value
0.1.1 - 2020-05-27
- Change chart namespace from giantswarm to monitoring
0.1.0 - 2020-05-27
- First release.