One of the teams I worked with would do an “engineering pain-point” survey twice a year. During one of those surveys, the main complaint was that on-calls had a hard time getting help from other teams or even engineers from their own team. I shadowed an on-call rotation for a couple of weeks and noticed this pattern:
- The on-call receives a notification of a problem, either through an alert, a graph indicating an issue, or a report from a colleague.
- The on-call needs to perform a remedial action. Usually:
- Reverting a code change by someone else
- Disabling a feature flag owned by a different team
- Approaching someone from another team to get their opinion or ask them to revert something. Sometimes, this meant waking up that person.
In each of these cases, because the on-call was interrupting someone else, a discussion often followed (civil or otherwise) about whether the problem was ‘a real’ emergency. I have personally experienced a situation where I had to wake up the network on-call engineer at the company because all of the servers in a region were running an healthy, but were inaccessible to users. The response I received (understandable given that it was 3 AM on a weekend) was:
“Is this really an emergency? You have other regions.” I insisted, and eventually the network on-call found a load balancer that was accepting connections but blackholing the traffic. I could imagine that a less experienced version of myself stuggling to stand my ground, delaying the resolution of the problem by at least a day.
At this point, we had no consensus, either inside or outside the team, on what consistutes an emergency. Edge cases were obvious - the whole service being down is bad, and an image being off by one pixel can wait for working hours. Intermediaries, such as “we can’t release to prod because our end-to-end testing is broken” or “We have increased latency because a CDN wasn’t retaining our assets”, were not obvious, and on every incident necessitated a lengthy “how bad is it” discussion.
This is exhausting, especially when this happens at 3 am and you’re not very proficient at handling crises.
I started working on this guideline for “what is an emergency?” and getting everyone onboard.
To establish a consensus on “what kind of problems can we have, and how bad are they?” I used a mostly-growing concentric circle approach:
- I started with myself, drawing on my past memories and reviewing historical documentation of emergencies we had faced.
- My SRE comrades formed the second tier. We likely had at least one of us involved in any emergency and had more context and passion about the prioritization problem.
- All of our team’s engineers came third. Involving them had multiple advantages:
- It reflected the work being done and demonstrated that we care about the problem.
- It provided insight into emergencies that we might be missing, as we were sometimes blind to emergencies that we were not currently monitoring for.
- It identified interested parties who might be willing to work on this consensus or promote it within the team.
- Non-tech stakeholders were then consulted. They were able to ratify what we consider to be emergencies, and what are the threshold (e.g. what amount user disconnects should be considered an emergency).
- The external teams that we depended on (networking, hardware) were next. Having them in the loop meant that once we agreed on what constituted an emergency, pushback on our requests for help could be ended by referring to the document.
- The external teams that depended on us (e.g. video, which owned their user experience end-to-end) formed the final tier. This worked like the networking team but in reverse. We were less likely to be asked to stop our release pipeline or be woken up at night for something that we did not agree was an emergency.
I maintained a comprehensive record of the entire process in a Google document and ensured its accessibility to anyone interested in it. Additionally, I regularly revisited the document to ensure that:
- We accurately defined thresholds, such as “How many users can we lose before we need to wake up?”
- We kept the definitions simple and easy to understand, avoiding complex graphs, decision trees, or jargon. This made it possible for external teams or non-engineers to comprehend our criteria and confirm its validity.
- We established a real-time metric that notified us within a minute of detecting an issue, and if necessary, created one.
After the document was approved by everybody, I worked on mirroring it in our monitoring infrastructure, making sure that we are woken up iff (if and only if) there is a matching emergency criteria in the document.
How the solution looked
The actual numbers and metrics aren’t that interesting as much as the overall structure, so this is what I’ll share
Types of emergency
- “Things are on fire”: This type of emergency requires immediate attention, and it’s acceptable to wake people up or reach out to them on weekends.
- “Problem”: This type of emergency is a high priority but should only be addressed during work hours. There’s no need to wake people up, but other work should be deprioritized to resolve the problem.
We ensured that every metric was clearly defined and accompanied by a brief explanation of why it was important to measure. Our metric selection process involved three types of metrics:
- User experience proxies: Metrics such as average response time, session crash rate, and number of engagements per second were easily measurable and indicative of a bad user experience. We reached a quick consensus on these metrics as they aligned with our team’s top priority of ensuring a positive user experience.
- Pure infra: Metrics such as CPU/memory utilization, ingress/egress rate, and process restarts were hotly debated as to whether they should trigger an alert. However, our experience showed that these metrics were often the leading indicators of an impending user experience issue. Thus, we agreed to wake up on extreme values of these metrics, even if user engagement was not immediately affected.
- Delayed user experience metrics: Metrics such as time spent, monthly active users, and bug reports were important to non-technical stakeholders but were not easily measurable in real time. Although it was possible to approximate these metrics with some automation, we ultimately decided to manually check them and keep them as slow metrics. We left the door open to revisiting this decision later.
These conditions are straightforward and easy to understand, even during the middle of the night. They look like this:
- If the average response time goes over 1.5 seconds for 20 minutes, it’s an emergency class 2 (problem).
- If the average response time goes over 3 seconds for 10 minutes, it’s an emergency class 1 (things are on fire).
- If the memory utilization goes over 90% for 10 minutes, it’s an emergency class 2.
- If the time spent today drops 20% compared to the same day last week, it’s an emergency class 2.
The project’s didn’t end by publishing the document. We continuously refined the document and the corresponding monitoring infrastructure by adjusting thresholds and metrics.
Additionally, I monitored the on-call team’s efforts to request assistance and was pleased to observe a significant reduction in issues. The document’s acceptance by the teams provided a shared point of reference that helped to minimize any potential friction.