alertmanager reliably crashes on every boot #4130

calestyo · 2024-11-19T22:55:09Z

What did you do?

Every time when booting, alertmanager errors out (but works when starting later).

Environment

System information:

Debian bookworm, Linux 6.1.0-27-amd64 x86_64

Alertmanager version:

alertmanager, version 0.25.0 (branch: debian/sid, revision: 0.25.0-1+b4)
  build user:       [email protected]
  build date:       20230409-09:50:43
  go version:       go1.19.8
  platform:         linux/amd64

Prometheus version:

prometheus, version 2.42.0+ds (branch: debian/sid, revision: 2.42.0+ds-5+b5)
  build user:       [email protected]
  build date:       20230518-08:49:35
  go version:       go1.19.8
  platform:         linux/amd64

Logs:

Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[616]: ts=2024-11-19T22:43:43.564Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[616]: ts=2024-11-19T22:43:43.569Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 1.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[1141]: ts=2024-11-19T22:43:43.845Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[1141]: ts=2024-11-19T22:43:43.846Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 2.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1393]: ts=2024-11-19T22:43:44.249Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1393]: ts=2024-11-19T22:43:44.250Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 3.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1914]: ts=2024-11-19T22:43:44.665Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1914]: ts=2024-11-19T22:43:44.666Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 4.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor prometheus-alertmanager[2017]: ts=2024-11-19T22:43:45.093Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:45 lcg-lrz-monitor prometheus-alertmanager[2017]: ts=2024-11-19T22:43:45.094Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 5.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Start request repeated too quickly.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Failed to start prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Triggering OnFailure= dependencies.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Starting [email protected] - send systemd unit status via email to `root`...
Nov 19 23:43:47 lcg-lrz-monitor systemd[1]: [email protected]: Deactivated successfully.
Nov 19 23:43:47 lcg-lrz-monitor systemd[1]: Finished [email protected] - send systemd unit status via email to `root`.
Nov 19 23:44:11 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:11.659Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:44:11 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:11.660Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:44:53 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:53.930Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=9 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:45:21 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:45:21.648Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:45:21 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:45:21.650Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:31 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:31.648Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:31 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:31.650Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:53 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:53.930Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=9 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"

So it seems there are a number of errors involved here:

couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided
create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided

Not really sure what it means by "private IP" (or why it should need any), any normal UNIX daemon typically binds to the wildcard address if no specific bind addresses are given.

Also, the service is pulled in by multi-user.target and at that time any networking (including the statically configured global IPs) are long up.

Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused
These are also a bit strange, IMO,... at least is they'd cause the daemon to exit. I mean it should rather be clear that prometheus may not yet be running.

Anyway, if I start the daemon a bit later, it works just fine.

Cheers,
Chris,

The text was updated successfully, but these errors were encountered:

grobinson-grafana · 2024-11-20T09:47:37Z

Hi!

Alertmanager is crashing because it cannot get the information it needs to initialize the cluster for high availability mode. The error means it cannot find a private IP address for the system, which it advertises to other Alertmanagers in the same cluster.

If your system does not have a private IP address, and/or you do not need high availability mode, you can disable it with the following argument:

--cluster.listen-address=""

Error sending alert" err="Post "http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused

Your Prometheus can't send alerts to Alertmanager because it's crash looping.

calestyo · 2024-12-09T00:10:00Z

But:

Why does it even try a HA mode, if if haven't configured any other instances? Or are these tried to be auto-detected?
More important and as I've said before: at the time when prometheus-alertmanager is loaded during boot, all network interfaces have already been brought up since quite a while, and it does work when I manually start it later (by which time, no further interfaces or addresses have been added).

grobinson-grafana · 2024-12-09T10:54:28Z

But:

Why does it even try a HA mode, if if haven't configured any other instances? Or are these tried to be auto-detected?

That's just the default behavior. I'm not sure it makes sense either, but that's how it has been for as long as I can remember it. Someone else might know if there is a reason for this.

More important and as I've said before: at the time when prometheus-alertmanager is loaded during boot, all network interfaces have already been brought up since quite a while, and it does work when I manually start it later (by which time, no further interfaces or addresses have been added).

Does the interface have an IP address at this time? Is it possible it takes a while for an IP address to be assigned via DHCP, meaning the host has an "up" network interface that takes a while to become connected/established? That could explain why it doesn't work immediately on startup but later when you try it manually. Does it also work with systemd if you start it some time later after startup?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alertmanager reliably crashes on every boot #4130

alertmanager reliably crashes on every boot #4130

calestyo commented Nov 19, 2024

grobinson-grafana commented Nov 20, 2024

calestyo commented Dec 9, 2024 •

edited

Loading

grobinson-grafana commented Dec 9, 2024

alertmanager reliably crashes on every boot #4130

alertmanager reliably crashes on every boot #4130

Comments

calestyo commented Nov 19, 2024

grobinson-grafana commented Nov 20, 2024

calestyo commented Dec 9, 2024 • edited Loading

grobinson-grafana commented Dec 9, 2024

calestyo commented Dec 9, 2024 •

edited

Loading