Chapter 28: Outages

An outage is the visible consequence of faults that overwhelm the system's fault tolerance. When enough components fail simultaneously, or when a cascading failure propagates through dependencies, the system can no longer serve its users. Outages are the most consequential events in the life of a distributed system.

Outages have many causes. Hardware failures take down individual servers or entire racks. Software bugs can cause all instances of a service to crash simultaneously. Configuration errors can misconfigure routing, security, or capacity parameters. Dependency failures can cascade when a failed service causes its dependents to queue up and eventually fail. Overload can occur when traffic exceeds the system's capacity.

The impact of an outage depends on its scope (how many users are affected), duration (how long it lasts), and severity (whether data is lost or merely unavailable). A one-minute partial outage affecting 1% of users is very different from a one-hour complete outage with data loss. Incident classification systems help organizations triage and respond to outages appropriately.

The most important lesson from outages is that they should be studied, not just survived. Post-incident reviews (blameless postmortems) identify the root causes, contributing factors, and remediation actions that will prevent similar outages in the future. The insights from these reviews, accumulated over time, become the institutional knowledge that makes a system more resilient.