-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring incidents #1
Comments
Hello, and thanks for your feedback. It's true that when triggering "HARD", the script is creating the incident no matter if similar incidents have already been sent. However, if you have duplicate incidents, it means you have attached the event handler to multiple hosts (but same service) to a single Cachet component. As I do not collect the host name, it's hard to know which incident has to be updated/closed based only on the service name and description that is triggering RECOVERY. That's why i was assuming that the first incident i find is the right one (the handler do not keep track of opened incidents). If I check if the incident is already created, it would mean that as soon as a service is hitting RECOVERY, the incident will be resolved, even if some hosts are still down on that service. The only way to resolve this would be adding the host in parameters, and changing the script to store the information in a state file and/or send this information to Cachet, so it can find the right incident afterwards to update it. Can you confirm that you attached multiple Nagios hosts with same Nagios service ? |
Hello, Thank You for the quick response! |
In that case, it means the event handler didn't fire properly on RECOVERY. With a single host and single service setup, the event should set the status to "Fixed" upon RECOVERY, before triggering HARD CRIT/WARN again (it seems logical that the service has to trigger RECOVERY before triggering a CRIT/WARN again). From https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html :
Or your services may be flapping from one state to an other but the event should trigger when flapping. Did you see the event triggering in Nagios logs ? If not, please enable :
in Nagios configuration (and restart Nagios), then show us the corresponding lines from the log and a screenshot of the created events in Cachet (the bug has to appear again to be meaningful). Yes Cachet show a timeline, but that doesn't mean you don't have to close the open incidents: if there is still one unresolved incident in timeline, the status on the top of the page will still be red, so it should be fixed. I will try to improve the script in the next days/weeks to include hosts in parameters, to deal with the multiple hosts issues. Having a full host+service state will be more meaningful to get the right incident. Also, because I don't like showing hosts in public pages, I will implement this with a statefile to store the handled incidents. |
I figured it out after playing with POST and PUT requests to cachet, and further reading the cachet issues comments. Updating the status of incident 39 as an example: When posting a new incident the server returns the id in the response ( tested it in console ) 43 in that case: Hope the information helps. |
For people coming to this after the fact, I'd like to note that the following is no longer true in Cachet: "If there is still one unresolved incident in timeline, the status on the top of the page will still be red, so it should be fixed." You now have multiple incidents that show as a timeline, and when the final incident is marked as fixed, the status page returns to green. |
Greetings,
You have done a great job here!
One issue here - when an identical incident repeats more than once the status change to "Fixed" is updated only for the first occurrence. The following incidents are not updated.
The text was updated successfully, but these errors were encountered: