Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

gmhegde86 · 2024-11-05T13:07:37Z

What did you do?
Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications.
What did you expect to see?
When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state.
What did you see instead? Under which circumstances?
Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received.
Environment
Kubernetes

System information:

NA
Alertmanager version:

0.27.0
Prometheus version:

NA
Alertmanager configuration file:

    global:
      resolve_timeout: 12h
    route:
      group_by: [alert_id, cluster_id]
      receiver: dbaas-alerting-webhook
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
    receivers:
    - name: dbaas-alerting-webhook
      webhook_configs:
      - url: https://aclp-alerting.xxxx.com/monitor/alerts
        send_resolved: true
        http_config:
          tls_config:
            ca_file: /etc/vm/secrets/cloud-observability-ca/ca.crt
            cert_file: /etc/vm/secrets/vmalertmanager-tls/tls.crt
            key_file: /etc/vm/secrets/vmalertmanager-tls/tls.key

Prometheus configuration file:
NA
Logs:

Alertmanager 1 of HA
ts=2024-11-04T15:44:41.278Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]"
ts=2024-11-04T15:44:41.615Z caller=notify.go:848 level=warn component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://aclp-alerting.iad3.us.prod.linode.com/monitor/alerts: {\"errors\": [{\"reason\": \"Service unavailable [1.10]\"}]}"
ts=2024-11-04T15:45:24.610Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=10 duration=1.182929464s

Alertmanager 2 of HA
s=2024-11-04T15:44:41.281Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]"
ts=2024-11-04T15:45:10.364Z caller=cluster.go:341 level=debug component=cluster memberlist="2024/11/04 15:45:10 [DEBUG] memberlist: Stream connection from=10.2.0.43:58784\n"
ts=2024-11-04T15:45:24.867Z caller=nflog.go:533 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alert_id=\\\"103\\\", cluster_id=\\\"188020\\\"}\" receiver:<group_name:\"dbaas-alerting-webhook\" integration:\"webhook\" > timestamp:<seconds:1730735124 nanos:610174464 > firing_alerts:11267836725140231328 > expires_at:<seconds:1730821524 nanos:610174464 > "
ts=2024-11-04T15:45:28.172Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=12 duration=956.231897ms

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

gmhegde86 commented Nov 5, 2024

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

Comments

gmhegde86 commented Nov 5, 2024