Chapter 40: Root Causes

After an incident is mitigated, the most important work begins: understanding why it happened. Root cause analysis goes beyond identifying what broke to understanding the deeper systemic conditions that allowed the failure to occur. A server running out of disk space is a proximate cause; the root cause might be that no capacity alerting existed, or that log rotation was never configured for a new service.

The "five whys" technique is a simple but effective method: ask why the failure occurred, then ask why that condition existed, and repeat until you reach a systemic cause. The answer is rarely a single root cause. Most incidents have multiple contributing factors — a latent bug, a configuration gap, a monitoring blind spot — that align to produce a failure. Identifying all contributing factors is more valuable than pinpointing one "root cause" to blame.

Blame is the enemy of learning. If engineers fear punishment for mistakes, they will hide information, and the organization loses its ability to learn from failures. Blameless post-incident reviews focus on the system, not the individual. The question is never "who caused this?" but "what conditions allowed this to happen, and how do we change the system so it cannot happen again?" This mirrors the philosophy behind the faults and outages chapters: failures are inevitable, and resilience comes from how we respond to them.

Documenting findings is essential. A well-written post-incident review captures the timeline, the contributing factors, the impact, and the corrective actions. These documents become part of the organization's institutional memory, allowing future engineers to learn from past incidents without having to experience them firsthand.