Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

Open
gmhegde86 opened this issue Nov 5, 2024 · 0 comments

Comments

@gmhegde86
Copy link

What did you do?
Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications.
What did you expect to see?
When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state.
What did you see instead? Under which circumstances?
Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received.
Environment
Kubernetes

  • System information:

    NA

  • Alertmanager version:

    0.27.0

  • Prometheus version:

    NA

  • Alertmanager configuration file:

    global:
      resolve_timeout: 12h
    route:
      group_by: [alert_id, cluster_id]
      receiver: dbaas-alerting-webhook
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
    receivers:
    - name: dbaas-alerting-webhook
      webhook_configs:
      - url: https://aclp-alerting.xxxx.com/monitor/alerts
        send_resolved: true
        http_config:
          tls_config:
            ca_file: /etc/vm/secrets/cloud-observability-ca/ca.crt
            cert_file: /etc/vm/secrets/vmalertmanager-tls/tls.crt
            key_file: /etc/vm/secrets/vmalertmanager-tls/tls.key

  • Prometheus configuration file:
    NA

  • Logs:

Alertmanager 1 of HA
ts=2024-11-04T15:44:41.278Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]"
ts=2024-11-04T15:44:41.615Z caller=notify.go:848 level=warn component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://aclp-alerting.iad3.us.prod.linode.com/monitor/alerts: {\"errors\": [{\"reason\": \"Service unavailable [1.10]\"}]}"
ts=2024-11-04T15:45:24.610Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=10 duration=1.182929464s

Alertmanager 2 of HA
s=2024-11-04T15:44:41.281Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]"
ts=2024-11-04T15:45:10.364Z caller=cluster.go:341 level=debug component=cluster memberlist="2024/11/04 15:45:10 [DEBUG] memberlist: Stream connection from=10.2.0.43:58784\n"
ts=2024-11-04T15:45:24.867Z caller=nflog.go:533 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alert_id=\\\"103\\\", cluster_id=\\\"188020\\\"}\" receiver:<group_name:\"dbaas-alerting-webhook\" integration:\"webhook\" > timestamp:<seconds:1730735124 nanos:610174464 > firing_alerts:11267836725140231328 > expires_at:<seconds:1730821524 nanos:610174464 > "
ts=2024-11-04T15:45:28.172Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=12 duration=956.231897ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant