Chapter 42: Prevention
The best incident is the one that never happens. Prevention shifts the focus from reactive response to proactive resilience. While no system can prevent all failures, deliberate engineering practices can eliminate entire classes of incidents and reduce the severity of those that do occur.
Chaos engineering is the practice of deliberately injecting failures into production systems to discover weaknesses before they cause real incidents. Game days — scheduled exercises where teams simulate large-scale failures — build both technical resilience and human readiness. Pre-mortems invert the post-incident review: before launching a new system, the team imagines it has already failed catastrophically and works backward to identify what could go wrong. These practices, combined with thorough design reviews, catch problems that testing alone cannot reveal.
Automation of manual toil is a powerful preventive measure. Every manual step in a runbook is an opportunity for human error under stress. Automating routine operational tasks — certificate rotation, capacity scaling, failover procedures — removes these error-prone steps and frees engineers to focus on novel problems. Defense in depth ensures that no single failure can cascade into a site-wide outage: redundant components, circuit breakers, bulkheads, and graceful degradation all contribute to a system that bends rather than breaks.
Prevention is ultimately a cultural practice. Organizations that invest in security reviews, load testing, and blameless post-incident processes build a culture where reliability is everyone's responsibility. Tracking incident recurrence classes — ensuring that the same type of incident does not happen twice — is the strongest signal that an organization is learning from its failures rather than merely surviving them.