Chapter 39: Escalation

Once an incident is detected, the next critical decision is how urgently to respond and who needs to be involved. Escalation is the process of raising an incident to the right people at the right time. Under-escalation leaves a serious problem in the hands of too few responders; over-escalation wastes attention and breeds cynicism about the severity system.

Severity levels provide a shared vocabulary for incident urgency. A common scheme uses four tiers: SEV1 for total service outages affecting all users, SEV2 for major degradation affecting a large subset, SEV3 for partial issues with limited user impact, and SEV4 for minor problems with no immediate user impact. Each severity level maps to a specific response: who gets paged, how quickly they must respond, and what communication is expected. Clear definitions prevent debates about severity during the stress of an active incident.

The incident commander role is the cornerstone of escalation. This person owns the incident response: they coordinate responders, delegate investigation tasks, decide when to escalate further, and ensure that communication flows to stakeholders. The incident commander does not need to be the most senior engineer — they need to be someone who can organize a response calmly under pressure.

Multi-service incidents, where a failure in one system cascades through others, require cross-team coordination. The monitoring service can reveal which services are affected, and the site event framework provides the structure for bringing the right teams together. Paging policies should account for time zones, on-call rotations, and backup responders to avoid single points of failure in the human response chain.