Icinga DB crashes when it can't reach the PostgreSQL database #620

log1-c · 2023-07-24T12:11:56Z

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database.
pgbouncer listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

We experienced some crashes of the daemon with error messages similar to #577
After some testing we noticed the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/[email protected]:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/[email protected]:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/[email protected]/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/[email protected]/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

The text was updated successfully, but these errors were encountered:

julianbrost · 2023-07-24T15:11:05Z

pq: unexpected message 'E'; expected ReadyForQuery

There's an issue in our database client driver: lib/pq#478. It would be interesting to know which error is hiding in there, so this will need further debugging.

I've also found an interesting comment in another project that claims (I haven't verified this myself but it sounds plausible) that the PostgreSQL wire protocol allows multiple error messages being sent: cockroachdb/cockroach#24149 (comment). So this sounds like it's quite possible this is an issue with the client, not the database cluster.

pq: cannot use serializable mode in a hot standby

According to the PostgreSQL documentation:

When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.

The Icinga DB daemon will always write to the database, which obviously can't work on a hot standby server. Could it be the case that connections are routed to the wrong server? Or is something like all servers are running as hot standby for a short time during a failover operation something that might happen?

log1-c · 2023-07-31T12:11:35Z

I have tested what happens during a switch-over:
Patroni node status:
Leader changes to "unknown" for a short time, new leader changes from "replica" to "leader".

hot_standby is necessary for patroni to work (patroni/patroni#1811 (comment)), so it can't be disabled.

So I'm not sure what to test/debug further.

But I think the IcingaDB daemon should not crash, when it can't connect to the database, but rather cache it's queries, like the IDO feature does/did.

julianbrost · 2023-07-31T12:59:46Z

For the "cannot use serializable mode in a hot standby" part of the issue, treating this error as retryable (similar to how we do for other server is starting/shutting down errors already), but from a quick online search, it looks like error code 0A000 (feature_not_supported) is used for that, so something quite generic (comparing error messages is always ugly, they better don't change ever).

log1-c · 2023-12-19T12:59:19Z

Follow up to this:

Due to the changes mentioned in the linked lib/pq issue the error message is now more specific:

Dec 17 20:18:19 msd-ic-ma02 icingadb[1674]: Starting history sync
Dec 17 20:18:20 msd-ic-ma02 icingadb[1674]: pq: cannot execute INSERT in a read-only transaction
                                            can't perform "INSERT INTO \"state_history\" 
...

We now "tuned" our haproxy config, so that it is faster in switching after detecting a non-functioning connection.
Since then the daemon did not crash on leading node switches.

Al2Klimov · 2023-12-19T15:06:21Z

Colleagues, we could re-try 25006 (read_only_sql_transaction) if we wish, once lib/pq#1136 has been merged. Shall we?

@log1-c Please could you add that code to https://github.com/Icinga/icingadb/blob/v1.1.1/pkg/retry/retry.go#L161-L175 locally and report what happens?

lippserd · 2024-08-06T09:26:59Z

We must not rely on lib/pq#1136 and fix this properly.

lippserd · 2024-08-06T09:30:21Z

[root@hostname ~]# systemctl status icingadb
● icingadb.service - Icinga DB
Loaded: loaded (/usr/lib/systemd/system/icingadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2024-08-02 09:46:12 CEST; 1min 44s ago
Process: 452833 ExecStart=/usr/sbin/icingadb --config /etc/icingadb/config.yml (code=exited, status=1/FAILURE)
Main PID: 452833 (code=exited, status=1/FAILURE)

...

Aug 02 09:45:04 msp-ic-ma01 icingadb[452833]: history-sync: Synced 10 state history items
Aug 02 09:45:11 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 10 ServiceState items
Aug 02 09:45:24 msp-ic-ma01 icingadb[452833]: history-sync: Synced 3 state history items
Aug 02 09:45:31 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 5 ServiceState items
Aug 02 09:45:44 msp-ic-ma01 icingadb[452833]: history-sync: Synced 9 state history items
Aug 02 09:45:51 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 11 ServiceState items
Aug 02 09:45:57 msp-ic-ma01 icingadb[452833]: overdue-sync: Synced 1 service overdue indicators
Aug 02 09:46:04 msp-ic-ma01 icingadb[452833]: history-sync: Synced 11 state history items
Aug 02 09:46:12 msp-ic-ma01 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 09:46:12 msp-ic-ma01 systemd[1]: icingadb.service: Failed with result 'exit-code'.

...

[root@hostname ~]# systemctl status icingadb
● icingadb.service - Icinga DB
Loaded: loaded (/usr/lib/systemd/system/icingadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2024-08-02 09:46:07 CEST; 50s ago
Process: 4072859 ExecStart=/usr/sbin/icingadb --config /etc/icingadb/config.yml (code=exited, status=1/FAILURE)
Main PID: 4072859 (code=exited, status=1/FAILURE)

...

Aug 02 09:43:49 hostname icingadb[4072859]: history-sync: Synced 6 state history items
Aug 02 09:44:09 hostname icingadb[4072859]: history-sync: Synced 4 state history items
Aug 02 09:44:29 hostname icingadb[4072859]: history-sync: Synced 6 state history items
Aug 02 09:44:29 hostname icingadb[4072859]: high-availability: Another instance is active
Aug 02 09:44:49 hostname icingadb[4072859]: history-sync: Synced 2 state history items
Aug 02 09:45:09 hostname icingadb[4072859]: history-sync: Synced 11 state history items
Aug 02 09:45:29 hostname icingadb[4072859]: history-sync: Synced 5 state history items
Aug 02 09:45:49 hostname icingadb[4072859]: history-sync: Synced 10 state history items
Aug 02 09:46:07 hostname systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 09:46:07 hostname systemd[1]: icingadb.service: Failed with result 'exit-code'.

...

Aug 02 09:58:20 hostname systemd[1]: Starting Icinga DB...
Aug 02 09:58:20 hostname icingadb[4102936]: Starting Icinga DB daemon (v1.2.0)
Aug 02 09:58:20 hostname icingadb[4102936]: Connecting to database at '127.0.0.1:5432'
Aug 02 09:58:20 hostname systemd[1]: Started Icinga DB.
Aug 02 09:58:20 hostname icingadb[4102936]: driver: bad connection
can't connect to database
main.run
github.com/icinga/icingadb/cmd/icingadb/main.go:70
main.main
github.com/icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
runtime/proc.go:271
runtime.goexit
runtime/asm_amd64.s:1695

...

Aug 02 10:34:53 hostname icingadb[4113948]: pq: unexpected message 'E'; expected ReadyForQuery
can't perform "INSERT INTO \"history\" (\"environment_id\", \"endpoint_id\", \"object_type\", \"service_id\", \"event_type\", \"state_history_id\", \"event_time\", \"host_id\", \"id\") VALUES (:environment_id,:endpoint_id,:object_type,:service_id,:event_type,:state_history_id,:event_time,:host_id,:id) ON CONFLICT ON CONSTRAINT pk_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""
github.com/icinga/icingadb/internal.CantPerformQuery
github.com/icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
github.com/icinga/icingadb/pkg/icingadb/db.go:412
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:60
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
github.com/icinga/icingadb/pkg/icingadb/db.go:407
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:82
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
github.com/icinga/icingadb/pkg/icingadb/db.go:407
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695

julianbrost · 2024-08-08T08:52:16Z

How is this unexpected message 'E' error returned? Is it another type than *pq.Error? Otherwise, shouldn't this be covered by #698 (in particular 33e07c4)?

Al2Klimov · 2024-08-08T09:59:33Z

Unfortunately, "pq: %s" errors -at least "unexpected message %q; expected ReadyForQuery"- are made in pq.errorf by fmt.Errorf and of type *errors.errorString. But #698 is a great idea...

oxzi · 2024-08-08T12:50:08Z

The particular error is crafted here, https://github.com/lib/pq/blob/3d613208bca2e74f2a20e04126ed30bcb5c4cc27/conn.go#L1814, in the readReadyForQuery method, expecting some "raw" Z message, while getting the logged E message instead.

The called errorf function itself wraps a panic(fmt.Errorf("pq: %s", fmt.Sprintf(s, args...))) call, resulting in panic("pq: unexpected message Z; expected ReadyForQuery"). Thus, it is not only an error, but a panic, which is recovered somewhere down the line, @Al2Klimov. Btw, I have not found a promising recover in the library with a quick grep.

yhabteab · 2024-08-08T14:19:36Z

Btw, I have not found a promising recover in the library with a quick grep.

The errors are all recovered within the conn.go file (search for errRecover(&err)).

log1-c changed the title ~~IcingaDB daemon crashes when it can't reach the database.~~ IcingaDB daemon crashes when it can't reach the psql database. Jul 31, 2023

Al2Klimov self-assigned this Aug 9, 2023

Al2Klimov mentioned this issue Sep 26, 2023

conn#readReadyForQuery(): on error (E) message throw that specific error lib/pq#1136

Open

log1-c closed this as completed Dec 19, 2023

Al2Klimov removed their assignment Dec 19, 2023

lippserd added the dev-call Issues and Pull Requests to be discussed at the Dev Call. label Aug 2, 2024

lippserd changed the title ~~IcingaDB daemon crashes when it can't reach the psql database.~~ Icinga DB crashes when it can't reach the PostgreSQL database Aug 6, 2024

lippserd added this to the 1.3.0 milestone Aug 6, 2024

lippserd reopened this Aug 6, 2024

lippserd added the bug Something isn't working label Aug 6, 2024

Al2Klimov mentioned this issue Aug 8, 2024

Actually retry every database error, including "pq: ..." Icinga/icinga-go-library#59

Merged

Al2Klimov self-assigned this Aug 12, 2024

yhabteab closed this as completed in Icinga/icinga-go-library#59 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga DB crashes when it can't reach the PostgreSQL database #620

Icinga DB crashes when it can't reach the PostgreSQL database #620

log1-c commented Jul 24, 2023

julianbrost commented Jul 24, 2023

log1-c commented Jul 31, 2023

julianbrost commented Jul 31, 2023

log1-c commented Dec 19, 2023

Al2Klimov commented Dec 19, 2023

lippserd commented Aug 6, 2024

lippserd commented Aug 6, 2024

julianbrost commented Aug 8, 2024

Al2Klimov commented Aug 8, 2024

oxzi commented Aug 8, 2024

yhabteab commented Aug 8, 2024

Icinga DB crashes when it can't reach the PostgreSQL database #620

Icinga DB crashes when it can't reach the PostgreSQL database #620

Comments

log1-c commented Jul 24, 2023

julianbrost commented Jul 24, 2023

log1-c commented Jul 31, 2023

julianbrost commented Jul 31, 2023

log1-c commented Dec 19, 2023

Al2Klimov commented Dec 19, 2023

lippserd commented Aug 6, 2024

lippserd commented Aug 6, 2024

julianbrost commented Aug 8, 2024

Al2Klimov commented Aug 8, 2024

oxzi commented Aug 8, 2024

yhabteab commented Aug 8, 2024