Chapter 37: Site Events
A site event is a significant incident that affects the availability, performance, or correctness of the planetary scale computer. Site events range from minor (a brief latency spike affecting a single service) to major (a complete data center outage lasting hours). How an organization detects, responds to, and learns from site events determines the long-term reliability of its systems.
The lifecycle of a site event has distinct phases: detection, triage, mitigation, resolution, and post-incident review. Detection should be automated through the monitoring service — alerts fire when health metrics exceed thresholds or when anomalies are detected in traffic patterns. The goal is to detect incidents before users report them.
Triage determines the severity and scope of the incident. Is it affecting all users or a subset? Is data at risk? Is the incident spreading? The answers to these questions determine the response: who is paged, what communication goes out, and what immediate actions are taken. Clear severity definitions and escalation procedures prevent confusion during high-stress incidents.
Mitigation focuses on restoring service as quickly as possible, even if the root cause is not yet understood. Common mitigation actions include rolling back a recent deployment, failing over to a healthy replica, shedding load to reduce pressure on overloaded components, or disabling a misbehaving feature. Root cause analysis comes later, during the post-incident review.