Chapter 27: Faults

In a planetary scale computer, faults are not exceptional events — they are the norm. With millions of components, something is always failing somewhere: a disk is developing bad sectors, a network switch is dropping packets, a server is running out of memory, a software bug is causing a crash. The question is not whether faults will occur but how the system responds when they do.

Faults can be classified by their scope and duration. Transient faults are brief: a momentary network glitch, a garbage collection pause, a brief CPU spike. These are best handled with retries, as the fault typically resolves itself. Intermittent faults recur unpredictably: a flaky disk, a network link with packet loss, a service with a memory leak. These require detection and remediation. Permanent faults are lasting: a dead disk, a failed server, a corrupted dataset. These require replacement or recovery.

Fault tolerance is built through redundancy at every level. Data is replicated across multiple servers (as in our consensus system). Services run on multiple machines (as managed by our discovery service). Network paths are duplicated. Power supplies have backup. The goal is to ensure that no single fault — and ideally no combination of two concurrent faults — causes a user-visible failure.

Detection is as important as tolerance. Our monitoring service detects faults through heartbeat timeouts and metric anomalies. The faster a fault is detected, the faster it can be mitigated — whether by automated failover or human intervention.