Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atlas general alerting review and migration to mimir #3318

Closed
5 tasks done
Tracked by #3312
QuentinBisson opened this issue Mar 11, 2024 · 9 comments
Closed
5 tasks done
Tracked by #3312

Atlas general alerting review and migration to mimir #3318

QuentinBisson opened this issue Mar 11, 2024 · 9 comments

Comments

@QuentinBisson
Copy link

QuentinBisson commented Mar 11, 2024

Towards #3312

Atlas is planning to migrate our monitoring setup to mimir targetting CAPI only.
This will result in all data being in a single database, instead of the current one-prometheus-per-cluster setup.
Current alerts have to be updated as queries will see all data for all clusters, MC and WC alike, instead of data for one specific cluster at a time.

We already did a lot of work towards this on the current alerts (removed a lot of deprecated alerts and providers, fixed alerts that clearly were not working an so on).

By doing so, we discovered a few things about Mimir itself but also that a chunk of our alert currently do not work on CAPI (e.g. based on vintage only components, deprecated and missing metrics an so on).

To ensure proper monitoring in CAPI and with Mimir, Atlas needs your help!

We would kindly ask all teams to help us out for the following use-cases, ordered in terms of priorities if they can't be performed all at once.

1. Test and fix your teams alerts and dashboards on CAPI clusters.

A lot of the alerts we have do not work on CAPI (e.g. cluster-autoscaler, ebs-csi and external-dns) simply because they are flagged behind the "aws" provider only, or because they rely on metrics of vintage components (cluster_created|upgraded inhibitions).
The specific alerts issue that were identified will be added to the team issues.

  • Atlas finished reviewing their alerts in prometheus-rules and sloth-rules on CAPI - please report it in the umbrella issue

2. Test and fix your teams alerts and dashboards on Mimir.

We currently have Mimir deployed on Golem for testing of alerts accessible as a datasource in grafana.

Current known/unknown with Mimir are behing written here by @giantswarm/team-atlas but feel free to add what you found.

We request a second round of testing for Mimir because Mimir in inherently different from our vintage monitoring setup.
First,all metrics will be stored in one central place (we are not enabling multi-tenancy yet). This means that:

  • No two alerts should have the same name
  • All aggregations (sum, count, ...) must at the very least have cluster_id, provider and pipeline in the by clause
  • All joins must at the very least have cluster_id, provider and pipeline in the on clause
  • the promql absent function should be use carefully because this function renders an empty vector so having it empty for all clusters in a MC seems relatively impossible. If you target 1 cluster in particular, this could work (cluster_type="management_cluster" for example but we think it's best to rely on other mechanisms)

Second, for grafana cloud, we rely a lot on external labels (labels added by prometheus when metrics leave the cluster like installation, provider and so on) but data sent from mimir to grafana cloud will not have those external labels anymore so recording rules aggregations and join must contain all eternal labels in the on and by clauses (that was mostly done by atlas but please review)

Third, we know that the alerting link (prometheus query) in opsgenie and slack will not work directly because Mimir does not have a UI per se (hint: it's grafana). The only way to have this source link back is to migrate to mimir's alertmanager but that's a whole over beast that we cannot tacke right now so we advise you, for each alert, to try to find a dashboard can be linked to the alert to help with oncall.

  • Atlas finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir - please report it in the umbrella issue

3. Move away from the old slo framework towards sloth

We deprecated the old slo dashboard a while ago in favor of sloth but teams are not really using it. We would love if you could replace the old slo alerts with sloth-based ones.

4. Test Grafana Cloud dashboards with golem data

As mimir data will be sent to grafana cloud by a single prometheus with no external labels, we would like you to ensure the grafana cloud dashboard that your team owns work on golem.

This is currently blocked by #3159

  • Atlas fixed their grafana cloud dashboards - please report it in the umbrella issue

5. Move all apps (latest versions) to service monitors

Towards closing this https://github.com/giantswarm/giantswarm/issues/27145

There are still some leftovers (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.

This is not that much of a priority but the effort should be rather small and easy to finish so feel free to pick this up

To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster=<cluster_id> and check out the targets page.
If they are there (be careful to also check out the workload section), they need a servicemonitor :)

  • Atlas added their missing service monitors - please report it in the umbrella issue

We will of course be here to help you for the migration :)

Further info:

To help you, you can always add alert tests in prometheus-rules, those are great :)

@marieroque
Copy link

marieroque commented Mar 20, 2024

I'm working on reviewing all rules on CAPI and Mimir installations ONLY.
I'm comparing rules evaluated on grizzly between Mimir datasource and Prometheus datasource.

RECORDING RULES:

  • Managed Prometheus
  • Managed Alertmanager

ALERTING RULES:

  • AlertmanagerNotifyNotificationsFailing
  • AlertmanagerPageNotificationsFailing
  • WorkloadClusterWebhookDurationExceedsTimeoutAtlas
  • AppWithoutTeamAnnotation
  • DeploymentNotSatisfiedAtlas
  • DataDiskPersistentVolumeSpaceTooLow
  • ElasticsearchClusterHealthStatusYellow // no more supported
  • ElasticsearchClusterHealthStatusRed // no more supported
  • ElasticsearchDataVolumeSpaceTooLow // no more supported
  • ElasticsearchPendingTasksTooHigh // no more supported
  • ElasticsearchHeapUsageWarning // no more supported
  • FluentbitTooManyErrors
  • FluentbitDropRatio
  • FluentbitDown
  • FluentbitDaemonSetNotSatisfied
  • GrafanaDown
  • GrafanaFolderPermissionsDown
  • GrafanaFolderPermissionsCronjobFails
  • GrafanaPermissionJobHasNotBeenScheduledForTooLong
  • InhibitionClusterIsNotRunningPrometheusAgent // only on Vintage
  • KedaDown
  • KedaScaledObjectErrors
  • KedaWebhookScaledObjectValidationErrors
  • KedaScalerErrors
  • KubeStateMetricsDown
  • KubeStateMetricsSlow
  • KubeStateMetricsNotRetrievingMetrics
  • KubeConfigMapCreatedMetricMissing
  • KubeDaemonSetCreatedMetricMissing
  • KubeDeploymentCreatedMetricMissing
  • KubeEndpointCreatedMetricMissing
  • KubeNamespaceCreatedMetricMissing
  • KubeNodeCreatedMetricMissing
  • KubePodCreatedMetricMissing
  • KubeReplicaSetCreatedMetricMissing
  • KubeSecretCreatedMetricMissing
  • KubeServiceCreatedMetricMissing
  • LokiRequestErrors
  • LokiRequestPanics
  • LokiRingUnhealthy
  • ManagedLoggingElasticsearchDataNodesNotSatisfied // no more supported
  • ManagedLoggingElasticsearchClusterDown // no more supported
  • CollidingOperatorsAtlas
  • MimirComponentDown
  • GrafanaAgentForPrometheusRulesDown
  • MimirRulerEventsFailed
  • PrometheusAgentFailing
  • PrometheusAgentFailingInhibition
  • PrometheusAgentShardsMissing
  • PrometheusAgentShardsMissingInhibition
  • Heartbeat
  • MatchingNumberOfPrometheusAndCluster
  • PrometheusMetaOperatorReconcileErrors
  • PrometheusOperatorDown
  • PrometheusOperatorListErrors
  • PrometheusOperatorWatchErrors
  • PrometheusOperatorSyncFailed
  • PrometheusOperatorReconcileErrors
  • PrometheusOperatorNodeLookupErrors
  • PrometheusOperatorNotReady
  • PrometheusOperatorRejectedResources
  • PrometheusCantCommunicateWithKubernetesAPI
  • PrometheusMissingGrafanaCloud
  • PrometheusFailsToCommunicateWithRemoteStorageAPI
  • PrometheusRuleFailures
  • PrometheusJobScrapingFailure
  • PrometheusCriticalJobScrapingFailure
  • PromtailDown
  • PromtailRequestsErrors
  • SilenceOperatorReconcileErrors
  • SilenceOperatorSyncJobHasNotBeenScheduledForTooLong
  • SlothDown
  • ServiceLevelBurnRateTooHigh

SLOTH:

  • AtlasOperatorsReconciliationError

@QuentinBisson
Copy link
Author

You're missing there servicelevelburnratetoohigh alert

@marieroque
Copy link

marieroque commented Mar 21, 2024

Using mimirtool on grizzly:

  • Extract metrics used in Mimir ruler:
kubectl port-forward -n mimir svc/mimir-gateway 8080
mimirtool analyze ruler --address=http://localhost:8080 --id=anonymous

and got this file:
metrics-in-ruler.json

  • Analyze those rules on mimir:
[~]$ mimirtool analyze prometheus --address=http://localhost:8080/prometheus --id=anonymous --ruler-metrics-file=metrics-in-ruler.json
INFO[0002] 155801 active series are being used in dashboards 
INFO[0002] Found 3749 metric names                      
INFO[0017] 331303 active series are NOT being used in dashboards 
INFO[0017] 271 in use active series metric count        
INFO[0017] 3473 not in use active series metric count   

and got this file:
prometheus-metrics.json

  • Analyze those rules on prometheus-grizzly:
[~]$ kubectl port-forward -n grizzly-prometheus prometheus-grizzly-0 9090
[~]$ mimirtool analyze prometheus --address=http://localhost:9090/grizzly --ruler-metrics-file=metrics-in-ruler.json --output=prom-grizzly-output.json
INFO[0004] 149485 active series are being used in dashboards 
INFO[0004] Found 3271 metric names                      
INFO[0017] 287311 active series are NOT being used in dashboards 
INFO[0017] 249 in use active series metric count        
INFO[0017] 3021 not in use active series metric count   

and got this file:
prom-grizzly-output.json

@marieroque
Copy link

I sorted 2 above outputs to be able to compare them:

cat prometheus-metrics.json | jq -S > prometheus-metrics-sort.json
cat prom-grizzly-output.json | jq -S > prom-grizzly-output-sort.json

prometheus-metrics-sort.json
prom-grizzly-output-sort.json

@QuentinBisson
Copy link
Author

##Old Slo framework

The old slo framework is used here:

@marieroque
Copy link

Even if those files give information about metrics used in rules, it's not useful to check if the rules have the same behavior on Mimir and Prometheus.

For now, I didn't find a way to do it "smartly".
I need to edit rules to have results and be able to compare whatever.
And for each rules, it's different. Sometimes, I have to remove the operator > 0 for instance.
Sometimes I have to remove some labels like status_code=~"5..|failed" for instance.
Because If I didn't, the rules execution has no data and I can't compare anything.
😞
I think I'll have to check them one by one manually...

@marieroque
Copy link

marieroque commented Mar 27, 2024

  • DeploymentNotSatisfiedAtlas is CAPI/Mimir compatible but it's not up to date with all our components
  • FluentbitDown need to add labels installation, provider and pipeline
  • KedaDown need to add labels cluster_id, installation, provider and pipeline
  • LokiRequestPanics need to add labels installation, provider and pipeline
  • LokiRingUnhealthy need to add labels provider and pipeline
  • LokiRequestErrors need to add labels installation, provider and pipeline. Be careful with the rate interval set in that query (1m). For now, it's not a problem because Loki ServiceMonitor has set a scrape interval to 15s. Maybe having 15s is too much for Loki metrics and if we update that value to 1m, we will need to update the rate interval in that rules to more than 1m.
  • CollidingOperatorsAtlas need to add labels installation, provider, pipeline
  • MimirComponentDown need to add labels installation, provider, pipeline
  • GrafanaAgentForPrometheusRulesDown need to add labels installation, provider, pipeline
  • GrafanaFolderPermissionsDownneed to add labels cluster_id, installation, provider, pipeline
  • GrafanaFolderPermissionsCronjobFails need to add labels cluster_id, installation, provider, pipeline
  • GrafanaPermissionJobHasNotBeenScheduledForTooLong need to add labels cluster_id, installation, provider, pipeline
  • PrometheusOperatorListErrors need to add labels installation, provider and pipeline
  • PrometheusOperatorWatchErrors need to add labels installation, provider and pipeline
  • PrometheusOperatorReconcileErrors need to add labels installation, provider and pipeline
  • PrometheusOperatorNotReady need to add labels installation, provider and pipeline
  • PrometheusMissingGrafanaCloud need to add label cluster_type=management_cluster
  • PrometheusJobScrapingFailure need to add labels provider and pipeline
  • PrometheusCriticalJobScrapingFailure need to add labels provider and pipeline
  • PromtailDown need to add labels installation, provider and pipeline
  • PromtailRequestsErrors need to add labels installation, provider and pipeline + rule invalid because we set a rate interval of 1m whereas the scrape interval in the ServiceMonitor is set to 60s. We need to increase the rate interval to 2m at least.
  • PrometheusAgentFailingInhibition not sure about the for property. Currently set to 1m, but scrape interval of Prometheus-agent ServiceMonitor is 60s also. Maybe the inhibition is not properly set and we should set a for=2m instead.
  • PrometheusAgentShardsMissing need to add labels installation, provider and pipeline
  • PrometheusAgentShardsMissingInhibition need to add labels installation, provider and pipeline. Same comment as PrometheusAgentFailingInhibition regarding for
  • SilenceOperatorSyncJobHasNotBeenScheduledForTooLong need to add labels installation, provider and pipeline
  • SlothDown need to add labels installation, provider and pipeline
  • KubeStateMetricsSlow need to add labels installation, provider and pipeline
  • KubeStateMetricsNotRetrievingMetrics need to add labels installation, provider and pipeline
  • KubeStateMetricsDown rewrite rules for Mimir case by removing absent function /!\ add a condition on Mimir enabled:
count by (cluster_id, installation, provider, pipeline) (label_replace(up{app="kube-state-metrics", instance=~".*:8080"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*")) == 0
  or (
        label_replace(
            capi_cluster_status_condition{type="ControlPlaneReady", status="True"},
            "cluster_id",
            "$1",
            "name",
            "(.*)"
          ) == 1
        ) unless on (cluster_id, customer, installation, pipeline, provider, region) (
          count(up{app="kube-state-metrics", instance=~".*:8080"} == 1) by (cluster_id, customer, installation, pipeline, provider, region)
    )
  • MatchingNumberOfPrometheusAndCluster need to add labels installation, provider, pipeline + rewrite rules to fix it because not working
 (
          sum by(cluster_id) (
            {__name__=~"cluster_service_cluster_info|cluster_operator_cluster_status", status!="Deleting"}
          ) unless sum by(cluster_id) (
            label_replace(
              kube_pod_container_status_running{container="prometheus", namespace!="{{ .Values.managementCluster.name }}-prometheus", namespace=~".*-prometheus"},
              "cluster_id", "$2", "pod", "(prometheus-)(.+)(-.+)"
            )
          )
        ) or (
          sum by (name) (
            capi_cluster_status_phase{phase!="Deleting"}
          ) unless sum by (name) (
            label_replace(kube_pod_container_status_running{container="prometheus",namespace=~".*-prometheus"}, 
            "name", "$2", "pod", "(prometheus-)(.+)(-.+)"
            )
          )
        )
        > 0
  • Managed Prometheus rewrite recording rules:
(up{app="prometheus-operator-app-prometheus",container="prometheus"}*0)+1
(up{app="prometheus-operator-app-prometheus",container="prometheus"}*-1)+1 == 1

=>

(up{app=~"kube-prometheus-stack-prometheus-operator|prometheus-operator-app-prometheus",container=~"kube-prometheus-stack|prometheus"}*0)+1
(up{app=~"kube-prometheus-stack-prometheus-operator|prometheus-operator-app-prometheus",container=~"kube-prometheus-stack|prometheus"}*-1)+1 == 1
  • Managed AlertManager rewrite recording rules:
(up{app="prometheus-operator-app-alertmanager", container="alertmanager"}*0)+1
(up{app="prometheus-operator-app-alertmanager",container="alertmanager"}*-1)+1 == 1

=>

(up{app=~"alertmanager|prometheus-operator-app-alertmanager",container="alertmanager"}*0)+1
(up{app=~"alertmanager|prometheus-operator-app-alertmanager",container="alertmanager"}*-1)+1 == 1

@marieroque
Copy link

marieroque commented Apr 9, 2024

Status:

@QuentinBisson
Copy link
Author

All dashboards have been tested on both CAPI and Mimir and they all work.
The angular plugins were mostly removed.

The only remaining issues are:

As we have separate issues for those, I'm closing here

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants