-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga DB crashes when it can't reach the PostgreSQL database #620
Comments
There's an issue in our database client driver: lib/pq#478. It would be interesting to know which error is hiding in there, so this will need further debugging. I've also found an interesting comment in another project that claims (I haven't verified this myself but it sounds plausible) that the PostgreSQL wire protocol allows multiple error messages being sent: cockroachdb/cockroach#24149 (comment). So this sounds like it's quite possible this is an issue with the client, not the database cluster.
According to the PostgreSQL documentation:
The Icinga DB daemon will always write to the database, which obviously can't work on a hot standby server. Could it be the case that connections are routed to the wrong server? Or is something like all servers are running as hot standby for a short time during a failover operation something that might happen? |
I have tested what happens during a switch-over:
So I'm not sure what to test/debug further. But I think the IcingaDB daemon should not crash, when it can't connect to the database, but rather cache it's queries, like the IDO feature does/did. |
For the "cannot use serializable mode in a hot standby" part of the issue, treating this error as retryable (similar to how we do for other server is starting/shutting down errors already), but from a quick online search, it looks like error code 0A000 (feature_not_supported) is used for that, so something quite generic (comparing error messages is always ugly, they better don't change ever). |
Follow up to this: Due to the changes mentioned in the linked lib/pq issue the error message is now more specific:
We now "tuned" our haproxy config, so that it is faster in switching after detecting a non-functioning connection. |
Colleagues, we could re-try 25006 (read_only_sql_transaction) if we wish, once lib/pq#1136 has been merged. Shall we? @log1-c Please could you add that code to https://github.com/Icinga/icingadb/blob/v1.1.1/pkg/retry/retry.go#L161-L175 locally and report what happens? |
We must not rely on lib/pq#1136 and fix this properly. |
|
Unfortunately, "pq: %s" errors -at least "unexpected message %q; expected ReadyForQuery"- are made in pq.errorf by fmt.Errorf and of type *errors.errorString. But #698 is a great idea... |
The particular error is crafted here, https://github.com/lib/pq/blob/3d613208bca2e74f2a20e04126ed30bcb5c4cc27/conn.go#L1814, in the The called |
The errors are all recovered within the |
We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use
pgbouncer
andhaproxy
for the connection to the database.pgbouncer listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.
We experienced some crashes of the daemon with error messages similar to #577
After some testing we noticed the following:
If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:
Without a config deploy at switch-over time the error message changes to:
Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.
The text was updated successfully, but these errors were encountered: