You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background: I've been running consul-alerts for years in a cluster of 3 consul servers.
The last restart I've done was ~4 months ago.
Today external website monitoring alerted me that one of my 3 machines was down (the reason being that the underlying VPN excluded it), including 1 of the 3 consul servers. I was surprised that external monitoring told me that, and consul-alerts did not, as that is its main job.
I checked the configured consul-alerts/config/notifiers/log/path on all 3 machines, and the last outputs as recorded in those log files were from 7 days ago, 6 days ago, and 3 days ago respectively.
System logs on all 3 machines:
[root@node-1:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-03 02:23:58 UTC; 4 months 26 days ago
Main PID: 1703 (consul-alerts)
Tasks: 83 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─ 1703 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─ 1769 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
├─ 1770 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─17052 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@node-2:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-01-17 17:50:03 UTC; 2 months 12 days ago
Main PID: 21793 (consul-alerts)
Tasks: 76 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─21793 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─21806 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─21807 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Mar 30 16:14:05 node-2 consul-alerts[21793]: 2020/03/30 16:14:05 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:17:12 node-2 consul-alerts[21793]: 2020/03/30 16:17:12 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:20:20 node-2 consul-alerts[21793]: 2020/03/30 16:20:20 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:23:27 node-2 consul-alerts[21793]: 2020/03/30 16:23:27 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:26:34 node-2 consul-alerts[21793]: 2020/03/30 16:26:34 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:29:41 node-2 consul-alerts[21793]: 2020/03/30 16:29:41 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:32:48 node-2 consul-alerts[21793]: 2020/03/30 16:32:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:35:56 node-2 consul-alerts[21793]: 2020/03/30 16:35:56 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:39:03 node-2 consul-alerts[21793]: 2020/03/30 16:39:03 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:42:10 node-2 consul-alerts[21793]: 2020/03/30 16:42:10 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
[root@node-3:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-03 02:24:38 UTC; 4 months 26 days ago
Main PID: 1592 (consul-alerts)
Tasks: 76 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─1592 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─1624 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─1625 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Mar 19 13:27:24 node-3 consul-alerts[1592]: 2020/03/19 13:27:24 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Mar 19 13:27:36 node-3 consul-alerts[1592]: 2020/03/19 13:27:36 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 20s
Mar 30 15:07:48 node-3 consul-alerts[1592]: 2020/03/30 15:07:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Note the Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. is not important here -- that is what systemd prints for log rotation, but the newer (perhaps relevant) logs are still present here).
In the output we can see that node-1 logged no errors, node-2 was in a retry loop (makes sense given that it is the one that was kicked out of the VPN, so it cannot talk to the Consul server quorum), and node 3 apparently had a short error today, but retried only once because its retry did not grow beyond 5s.
If I log into the consul dashboard (localhost:8500 on either server), I can see that lots of services are marked red.
Yet, consul-alerts today sent me no notifications at all -- neither to Slack (as I have it configured), nor into the configured log file. It usually does notify my.
So the question is: Why did it fail as a watchdog today?
I would have expected it to notify me of the errors shown in the Consul UI shortly after they ocurred.
A systemctl restart consul-alerts on node-1 immediately resulted in a Slack notification, and correct entries in the log file (but only in the log file of node-3!).
So it seems that something got stuck, and only manual restarting revived it (which is of course not good for an automated high-availability watchdog).
I cannot currently tell whether the fault is in consul-alerts, or in consul itself (consul-alerts starts consul watch, so in case that one hung and didn't output anything, it would be consul's fault).
Maybe anybody knows anything?
The text was updated successfully, but these errors were encountered:
Background: I've been running consul-alerts for years in a cluster of 3 consul servers.
The last restart I've done was ~4 months ago.
Today external website monitoring alerted me that one of my 3 machines was down (the reason being that the underlying VPN excluded it), including 1 of the 3 consul servers. I was surprised that external monitoring told me that, and
consul-alerts
did not, as that is its main job.I checked the configured
consul-alerts/config/notifiers/log/path
on all 3 machines, and the last outputs as recorded in those log files were from 7 days ago, 6 days ago, and 3 days ago respectively.System logs on all 3 machines:
Note the
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
is not important here -- that is what systemd prints for log rotation, but the newer (perhaps relevant) logs are still present here).In the output we can see that
node-1
logged no errors,node-2
was in a retry loop (makes sense given that it is the one that was kicked out of the VPN, so it cannot talk to the Consul server quorum), and node 3 apparently had a short error today, but retried only once because its retry did not grow beyond5s
.If I log into the consul dashboard (
localhost:8500
on either server), I can see that lots of services are marked red.Yet, consul-alerts today sent me no notifications at all -- neither to Slack (as I have it configured), nor into the configured log file. It usually does notify my.
So the question is: Why did it fail as a watchdog today?
I would have expected it to notify me of the errors shown in the Consul UI shortly after they ocurred.
A
systemctl restart consul-alerts
onnode-1
immediately resulted in a Slack notification, and correct entries in the log file (but only in the log file ofnode-3
!).So it seems that something got stuck, and only manual restarting revived it (which is of course not good for an automated high-availability watchdog).
I cannot currently tell whether the fault is in
consul-alerts
, or inconsul
itself (consul-alerts
startsconsul watch
, so in case that one hung and didn't output anything, it would be consul's fault).Maybe anybody knows anything?
The text was updated successfully, but these errors were encountered: