-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alertmanager reliably crashes on every boot #4130
Comments
Hi! Alertmanager is crashing because it cannot get the information it needs to initialize the cluster for high availability mode. The error means it cannot find a private IP address for the system, which it advertises to other Alertmanagers in the same cluster. If your system does not have a private IP address, and/or you do not need high availability mode, you can disable it with the following argument:
Your Prometheus can't send alerts to Alertmanager because it's crash looping. |
But:
|
That's just the default behavior. I'm not sure it makes sense either, but that's how it has been for as long as I can remember it. Someone else might know if there is a reason for this.
Does the interface have an IP address at this time? Is it possible it takes a while for an IP address to be assigned via DHCP, meaning the host has an "up" network interface that takes a while to become connected/established? That could explain why it doesn't work immediately on startup but later when you try it manually. Does it also work with systemd if you start it some time later after startup? |
What did you do?
Every time when booting,
alertmanager
errors out (but works when starting later).Environment
Debian bookworm, Linux 6.1.0-27-amd64 x86_64
So it seems there are a number of errors involved here:
couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided
create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided
Not really sure what it means by "private IP" (or why it should need any), any normal UNIX daemon typically binds to the wildcard address if no specific bind addresses are given.
Also, the service is pulled in by
multi-user.target
and at that time any networking (including the statically configured global IPs) are long up.Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused
These are also a bit strange, IMO,... at least is they'd cause the daemon to exit. I mean it should rather be clear that prometheus may not yet be running.
Anyway, if I start the daemon a bit later, it works just fine.
Cheers,
Chris,
The text was updated successfully, but these errors were encountered: