Skip to content

Incident Response

Bianca Rivera Alvelo edited this page Jul 29, 2024 · 30 revisions

Incident Response

VRO is a moderately complex system that integrates with other systems and is hosted on a shared infrastructure. Incidents of unplanned service failures and disruptions are inevitable. The first objective of the incident response process is to restore a normal service operation as quickly as possible and minimize the incident's impact. This document describes the core steps for responding to incidents.

Definition

We define an incident as an unplanned event or occurrence that disrupts normal operations, services, or functions on the VRO platform. These negatively impact the availability, performance, security, or functionality of a VRO service and require immediate attention to mitigate its effects and restore normalcy. Incidents can vary widely in scope and severity and can be caused by factors in or out of the VRO team's control.

Examples:

  • Service Outages: Complete or partial unavailability of infrastructure services.
  • Performance Degradation: Noticeable slowdown or inefficiency in infrastructure services.
  • Security Breaches: Unauthorized access, data breaches, or vulnerabilities affecting infrastructure integrity, confidentiality, or availability.
  • Operational Failures: Failures in deployment pipelines, configuration management, or automated processes impacting normal operations.
  • Resource Exhaustion: Over-utilization or exhaustion of resources leading to degraded service.
  • Unexpected Behavior: Anomalies or unexpected behaviors in infrastructure services affecting development, testing, or deployment activities.

Root Cause

The root cause of an incident is investigated and assigned to the team responsible for the issue’s origin - not necessarily the team addressing it - to ensure accurate reporting and unbiased resolution efforts.

Root causes are documented in Incident Reports and labeled on incident tickets. During the resolution process of an incident, the root cause is not included in updates or communications with partner teams to avoid bias and ensure impartiality.

Types of Root Causes:

  1. VRO - an issue on the VRO platform that ties directly to our team's scope of responsibilities

These incidents are tracked to measure our MTTR (Mean Time to Resolve) metric.

  1. Partner Team Application or External VA - an issue with the partner team application controlled by the partner team, or due to a VA external system not functioning appropriately

These are not within the VRO team's control, but the VRO team assists in resolving the incident alongside the partner team affected.

  1. LHDI - an issue on the LHDI platform

These are not within the VRO team's control, but the VRO team reports them to LHDI and works to resolve them in partnership with LHDI.

Responsibilities

As a default, incident response is the responsibility of the VRO primary on-call engineer, which rotates with every VRO sprint period. Throughout the process, they might personally conduct each step or delegate tasks as needed; regardless, a single individual should be identified as being in charge of the incident response. If this responsibility needs to be transferred while an incident is active, then this handoff should be explicitly communicated (sample language provided in Acknowledge).

Slack channel #benefits-vro-on-call is designated for messaging about active incidents, to foster shared situational awareness among the VRO team, partner teams, and stakeholders. During an incident, status updates should be provided at regular intervals and when key developments occur, to be prepared by the individual leading the incident response or a delegate.

On the VRO backlog (#2959) is a communication plan for non-VRO team members during an incident. Pending materials from #2959, strive for messaging that is succinct and specific; links to pertinent reference materials (eg app performance dashboards; GH commits; StackOverflow threads); and avoids jargon or acronyms that non-VRO team members may not be familiar with.

Also on the VRO backlog (#3005) is developing more guidance on handling common on-call situations.

Process

Incident Response flowchart

Step Description Estimated time to complete SLA*
Step 0: Acknowledge Upon learning of a potential incident, acknowledge the report and start investigating. 2 minutes within 60 minutes of report
Step 1: Triage Conduct a brief assessment. Determine an initial severity level (SEV) and affected systems. Share initial assessment and intended next steps. 10 minutes initial assessment within 30 minutes of Acknowledgement
Step 2: Contain/Stabilize Work to prevent further damage. Share regular status updates.
(at team's discretion for SEV 3 or SEV 4)
varies SEV 1: 16 business hours
SEV 2: 24 business hours
SEV 3, SEV 4: as soon as VRO can prioritize
Step 3: Remediate (short-term) Restore system functionality. Share regular status updates. Follow up with the party that reported the incident (if applicable).
(at team's discretion for SEV 3 or SEV 4)
varies SEV 1: 16 business hours
SEV 2: 24 business hours
SEV 3, SEV 4: as soon as VRO can prioritize
Step 4: Monitor Verify that the incident is under control. at minimum: 30 minutes n/a
Step 5: Log the incident Document the incident in the VRO wiki's Incident Reports. 30 minutes within 8 business hrs of the incident's resolution
Step 6: Post-incident review Assess what happened, what went well, what did not go well, and measures to prevent a recurrence.
(at team's discretion for SEV 3 or SEV 4)
4 hrs within 5 business days of the incident's resolution
Step 7: Discuss longer term remediation Prioritize implementation of preventative measures. varies within 2 sprint cycles of the incident's resolution

*Still to be defined is a service level agreement (SLA) between VRO and partner teams, to cover details including on-call coverage windows (eg 24/7? 9am-5pm ET on weekdays?). We expect to resolve this as part of #2959.

Step 0: Acknowledge

Upon learning of a potential incident, acknowledge the report and start investigating. Timebox: 2 minutes

Why: reduce likelihood of uncoordinated troubleshooting efforts; reduce panic; establish consistent data point for calculating incident metrics

Common situations and sample acknowledgements:

source acknowledgement
Incident Report (slack workflow) an emoji reaction of 👀 to the Incident Report post in #benefits-vro-support.
slack post other than the Incident Report 1. an emoji reaction of 👀 to the post.
2. an Incident Report using the slack workflow.
3. an emoji reaction of 👀 on the Incident Report post in #benefits-vro-support.
If the source is a person and they are not in #benefits-vro-support, direct them to the channel.
email message from a known partner or stakeholder 1. an email response: "Investigating. If you have access to the DSVA slack, I will be tracking this in #benefits-vro-support."
2. an Incident Report using the slack workflow.
3. an emoji reaction of 👀 on the Incident Report post in #benefits-vro-support.
email message from an unknown party Consult the VRO team and OCTO Enablement Team.

If at any time the individual in charge needs to transfer responsibility to someone else, this should be noted in #benefits-vro-on-call. Sample message: Handing over incident response to [person taking over the incident].

Step 1: Triage

Conduct a brief assessment. Determine scope and an initial severity level. Timebox: 10 minutes.

Why: gain situational awareness; respond with appropriate urgency

Table: Severity Levels

Severity Level Description Examples Priority
SEV 1 Core functionality is unavailable or buggy a VRO app appears offline; a VRO app's transactions are failing; a VRO data service appears unresponsive; inaccurate data is transmitted immediate investigation
SEV 2 Core functionality is degraded increased latency; increased retry attempts immediate investigation
SEV 3 Unexpected metrics related to core functionality, although without noticeable performance degradation sustained increase in CPU utilization; sustained increased in open database connections continued passive monitoring; within the next business day: investigation into what is causing the issue
SEV 4 Non-core functionality is affected gaps or increased latency in transmitting data to an analytics platform immediate investigation, limited to identifying root cause

Share this initial assessment and intended next steps in #benefits-vro-on-call.

Step 2: Contain/Stabilize

may not be required for SEV 3 or SEV 4

Work to prevent further damage. As needed, enlist assistance from VRO teammates, partner teams, and LHDI.

Why: containing the situation might provide more immediate relief than implementing a remediation

Some starter things to consider:

  • is there a configuration change to prevent requests to the buggy system?
  • would an increase in compute resources temporarily stabilize the system?

Share a status update at least 1x/hr during business hours to #benefits-vro-on-call, and before moving to the next step.

Step 3: Remediate (short-term)

may not be required for SEV 3 or SEV 4

Get the system back to a minimally acceptable operating status.

To consider:

  • should compute resources be recalibrated?
  • would a rollback or roll-forward of code/configuration would be appropriate and feasible?

Share a status update at least 1x/hr during business hours to #benefits-vro-on-call, and when this step is completed (at which point the incident is considered resolved).

Follow up with the party that reported the incident (if applicable).

Step 4: Monitor

Look for data points that indicate the incident is under control. As needed, return to Steps 2 (Contain/Stabilize) and 3 (Remediate).

Why: gain confidence that the incident is under control

Share a status update to #benefits-vro-on-call when starting this step and when this step is completed, and at any point if monitoring indicates the incident is not under control.

Sample messages:

  • A remediation is in-place. Starting a period of monitoring for stability.
  • Things appear close to back to normal. I'm closing this incident.

Step 5: Log the incident

Document the incident in the VRO wiki's Incident Reports. This is more of a snapshot of the incident rather than an in-depth analysis.

Why: build a record of incidents that can reveal patterns, inform engineering decisions, and be a general resource

Details to account for:

  • how the incident was detected, including timestamp
  • severity level
  • corrective measures, and timestamp of when system returned to operating status
  • "red herrings" that were encountered
  • follow-up tasks

Better guidance on how to log an incident is an expected outcome of #3005.

Step 6: Post-incident review

at team's discretion for SEV 3 and SEV 4

Assess what happened, what went well, what did not go well, and measures to prevent a recurrence. Describe troubleshooting measures, including log snippets and command line tools. Share this document with the VRO team and with partner teams.

Why: leverage the incident as a learning opportunity; surface further corrective measures

Follow principles of blameless post-mortems:

focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.

Step 7: Discuss longer term remediation

Determine measures that would reduce the likelihood of this incident recurring and/or give the team better visibility into conditions that led to this incident. Propose these to the Product Manager for consideration.


About this document

Maintaining and refining this document is the responsibility of the interdisciplinary VRO platform team. Feedback from VRO partner teams and other stakeholders is welcome, although changes should be agreed to by the VRO platform team, to include at least two engineers serving in the on-call rotation.

The flowchart image was generated using draw.io. Source file: incidentResponse.drawio.txt (remove the .txt extension in order to use in draw.io)

Clone this wiki locally