Skip to content
Go back

Demystifying On Call and Incident Management for your team.

Published:

Being on call and involvement in subsequent production incidents can be nerve-wracking for engineers of every experience level. I’m sharing the team handbook page I wrote for my current team to demystify on call and incident management, and (hopefully) make it seem less scary. We use PagerDuty for this at Unity, but it should be broadly applicable for any incident management tooling.

Feel free to share this if you find it useful yourself.

In future, I’ll share the template I wrote for my engineering group to act as post-mortem template to help us focus on the most pertinent details to record and explore.


Table of contents

Open Table of contents

On Call

This page outlines the expectations of going on call, and the typical flow for engineers responding to an incident.

It’s not a strict set of rules; we aim to achieve these but we use our best judgement in doing so.

Expectations

Incident management

Before being on call

SCHEDULED

Welcome to the rota! If the scheduled dates don’t match your availability - for example, clashes with holidays - raise with the team and drive for alignment with other engineers in the rota to make sure we have coverage during those clashing periods.

Ensure you are reachable through PagerDuty. PagerDuty provides Notification Settings to allow configuration of this. Please check these are accurate and use the built-in testing functionality to ensure you’re reachable.

Being Paged

ALERTED

PagerDuty has notified you of an alert. Being paged (e.g. called upon by PagerDuty) can be scary and induce panic. Please do not rush! Grab a drink, shake yourself loose and start investigating.

When attending to an alert, you need to determine whether an incident should be raised. The main determining factor for that is impact.

As a rule of thumb:

If you’ve been called out for non-critical circumstances - for example, it’s not impacting customers broadly and could have waited until the next working day before being actioned - bring that information to the team during work hours so we can reduce false alarms/unnecessary callouts. Alert fatigue (link) creeps up on us and we should stamp it out as early as we notice it.

Starting an incident

RAISE AN INCIDENT

The alert points to users being impacted, or we’ve received reports from users that our services are broken. To ensure this situation gets the attention and focus it deserves, an incident should be opened. If we’re unsure, err on the side of caution and open one anyway.

  1. Open incident using Slack/incident.io/tool-of-choice, adding the necessary metadata to identiy area of impact.
  2. Provide context – what are you seeing, what is the impact, what are the next steps.
  3. Pull your engineering team and your dedicated support team into the channel.

In doing so, you’ve assumed the Incident Commander role.

At this point, your primary goal is to mitigate/stop the impact, not necessarily fixing the root cause. We want to restore normal service as quickly and as safely as possible.

Your first step should be to make use of the affected services’ runbooks to mitigate the issue. Fixing the issue itself isn’t all on you – ensuring the right people are involved and work is being done to get it fixed is.

If you’re unable to resolve it and the issue is:

Communication is key throughout, and especially at this stage once the wider alarm has been rung. Keep the incident channel and related metadata updated with the current state and actions that are being taken.

Request actions of engineers responding to the incident - they are there to take direction from you and work on the most important next thing required to get the situation fixed and user impact mitigated. Coordinate the efforts of the incident responders debugging and working through the issue. Ensue outcomes are captured and communicated within the channel.

If there are others involved in the incident and you want to sychronise actions or tackle them together in sequence, consider starting a Zoom/Google Meet call to get more eyes on the immediate issue.

Being pulled into an in-progress incident

RESPONDING

You’ve been pulled into an active incident. As an incident responder, you are here to help the Incident Owner/Commander get a live issue fixed and the impact mitigated.

Announce that you’re available to help. Search out the service runbook to understand what the recommended steps are for this situation. If you have suggestions of what to look at and/or actions that could be taken to resolve the issue, raise them in the incident channel and tag the Incident Commander. Take direction from the Incident Commander.

Communication is key - provide regular updates on your actions and their outcomes. Bring data such as dashboard links to the channel to show current impact and any effects of actions taken. Avoid polluting the channel with unnecessary messages that don’t strive to resolve the situation. This ensures focus and attention can be dedicated to mitigation efforts - protect the signal-to-noise ratio of the channel.

While the incident is active and causing impact, the main goal is mitigating that impact. Strategic thinking and actions to achieve that will come later.

Impact mitigated

SUCCESS

The impact has been mitigated! Users are able to receive the service they expect from our products and are no longer blocked in achieving their goals. Great work!

Incident Commander: ensure the channel and any stakeholders are updated that service has resumed and impact has been mitigated. Using your incident tooling to resolve the incident will help communicate this. With this breathing room, dig into details about how the scenario came about with the other incident responders. 

Incident Responders: this stage should be focused on gathering data and honing in on the root cause of the issue. The goal is to build a shared understanding of the impacted area on the run up to and including the incident window. It’s a fact-gathering exercise that might involve investigating service dependencies as well as the affected service. Again, announce what you’re doing in the channel for the Incident Commander to see and help coordinate.

If the mitigation was a short-term fix (for example, scaling up a struggling service), look for longer-term fixes that will further reduce risk of reocurrence.

Write up, share, learn

Incident commander – with the incident resolved, start to complete the post-mortem document with details - it’s often useful to start with the timeline of events to frame the details for yourself, before defining root cause and quantifying impact.

Incident responders - reach out to the Incident Command on what can be done to help here. There may still be open questions that require investigation.

There may be strategic actions that should be written up. If the incident was due to use of outdated libraries, propose how can they be kept up-to-date reliably.

A powerful mindset for strategic actions is thinking about the “category” of issue, and exploring how that category of issue could be avoided entirely in future. For example, if it’s relating to overloading a service with requests, how can rate limiting be rolled out across services to avoid this category of problem.

With the post-mortem written up, schedule a postmortem session with team members and those involved in the incident to discuss it and the actions we’re taking away, taking onboard feedback and suggestions.

SUCCESS

Good job! You helped fix a production issue impacting our customer base, and helped strengthen our systems from similar failures in future. That’s a big win.



Next Post
Why I'm writing more, and it's not to write more.