-
Notifications
You must be signed in to change notification settings - Fork 6
On Call Responsibilities
This document provides comprehensive guidelines on the responsibilities and expectations for team members on call. It consolidates all relevant procedures, including incident response, communication standards, monitoring requirements, and deployment responsibility.
Summary
- On-call rotations align with each two-week sprint, consisting of a primary and a secondary engineer.
- The primary on-call person is the first point of contact for issues but is not necessarily the one who resolves everything. Responsibilities include monitoring application health, responding to outages, performing deployments, addressing SecRel failures, and reviewing Dependabot changes.
- Each reported incident must be converted into a ZenHub ticket and will be prioritized based on severity.
- Given the variety of responsibilities that on-call engineers are tasked with and the caveat mentioned in item 2 above, it is crucial for the on-call engineer to remain cognizant of their capacity to juggle these responsibilities and to reach out for secondary on-call assistance or even other teammates outside of the on-call rotation.
Each two-week sprint will have a story assigned to a primary and secondary engineer responsible for the on-call responsibilities outlined below. The secondary on-call person from a sprint will typically become the primary for the following sprint, and an engineer will volunteer to become the new secondary.
The primary contact's role is to be the first point of contact for issues but not necessarily the one who resolves everything. They should enlist the help of the team as needed and be responsible for tracking the completion of all on-call tasks. The secondary contact should remain accessible to the primary for assistance as required and concentrate on addressing smaller tickets or collaborating on larger ones during the sprint.
The VRO team defines an incident as an unplanned event or occurrence that disrupts normal operations, services, or functions on the VRO platform. On-call engineers must follow the Incident Response process. They are responsible for addressing the incident, providing updates, documenting discovery and remediation steps, and measuring metrics such as the MTTR (Mean Time to Resolve) and Impact of Partner Team app performance.
An additional responsibility of on-call engineers is to handle software deployments. The VRO Deployment Policy document describes the scheduled release cadence, pathway to requesting ad-hoc deployments, and the procedures for performing a deployment.
Deployments are scheduled to go out on the first Wednesday of a new sprint at a time that works best for the release captain. They will contain all VRO platform services and any partner team services requested for release. The #benefits-vro-on-call channel will receive an automatic reminder 24 hours before a release is scheduled and needs to initiate the workflow to notify partner teams of the release. During this time, partner teams should either specify the changes being released and any specific hashes to be deployed, or they may opt out of the release, and only VRO platform services will be pushed. The on-call engineer should also be available to handle ad-hoc deployment requests from partner teams to push code to lower, non-prod regions or to push emergency fixes to production.
Below are the primary Slack channels within the VA's Office of the Chief Technology Officer (OCTO) workspace, which should be monitored for incidents and alerts. A complete list of VRO on-call Slack channels can be seen here.
This channel should be monitored for alerts triggered through DataDog.
This channel should be monitored for communication to and from partner teams.
This channel should be monitored using the Report a VRO incident workflow and tracking PagerDuty notifications.
PagerDuty
PagerDuty is a support tool for handling scheduling and alerts for on-call engineers. The tool is configured to align with the engineers assigned as primary and secondary contacts for the inside. According to our custom escalation policies, it will alert the team when new incidents are reported.
DataDog
DataDog (wiki reference) is a cloud-based application monitoring tool that supports custom alerts and visualizations generated from incoming events. These alerts, which primarily relate to service availability and performance, are sent to the #benefits-vro-alerts Slack channel and should be addressed as soon as possible when received.
SecRel
The secure release (SecRel) pipeline is a GitHub Actions workflow that supports continuous delivery outcomes by validating that all software changes have addressed security and privacy risks. SecRel is a critical part of VRO's development work. It allows continuous authority to operate (cATO), which removes the burden of going through lengthy approval processes for each release. The pipeline can be manually run against any branch, automatically run against the develop branch with each merge, and scheduled to run on weekday mornings at 7 AM ET against the develop branch.
One step in this pipeline evaluates our dependencies against a database of known vulnerabilities. Suppose new vulnerabilities are found between one scan and the next. In that case, the SecRel pipeline will fail, and signed images will not be published to GitHub's container registry (GHCR) to be available for release. On-call engineers will be responsible for reviewing SecRel build results and resolving vulnerabilities as they are discovered.
This GitHub tool automatically updates dependencies to their latest versions. On-call engineers will review pull requests created by the bot and merge the proposed version updates into the develop branch. It's important to note that integration and unit tests should be run against new builds since updating dependencies can introduce defects.
Report a VRO incident Slack Workflow
This workflow is the primary intake process for escalating incidents on the VRO platform. This workflow is used by the VRO team and partner teams. The workflow is initiated by clicking the bookmark in #benefits-vro or #benefits-vro-on-call. For more information on how to use the workflow, see Incident Response.
Aqua
Aqua scans images for operating system vulnerabilities, malware, and insecure configurations and monitors images deployed into production as part of the SecRel pipeline. The results from this scan, in part, help determine whether a SecRel scan passes. Results are shown in the SecRel run's summ. They may include new vulnerabilities discovered between runs or as the result of version upgrades, and they should be remediated through LHDI's Aqua webpage. Remediation usually involves requesting exceptions or upgrading to recommended versions of software.
For all AWS-related vulnerabilities, please contact the Lighthouse cATO Technical Application Assessor (currently Andrew Palopoli) to have them addressed.
Snyk
Snyk scans developer source code and third-party libraries, or dependencies, for known vulnerabilities as part of the SecRel pipeline. The results from this scan, in part, help determine whether a SecRel scan passes. Results from these scans are displayed in a SecRel run's summary and on the repository's [SecuritySecurity//github.com/department-of-veterans-affairs/abd-vro-internal/security) tab.
For vulnerabilities that involve upgrading libraries, please look at the logs for the failed stage in the Secrel run and make any necessary PRs (example).