Chapter 41: Remediation
Remediation is the work of restoring a system to full health after an incident. It operates on three timescales: immediate mitigation to stop the bleeding, short-term fixes to stabilize the system, and long-term corrective actions to address root causes. Each timescale requires different trade-offs between speed and thoroughness.
Immediate mitigation prioritizes availability over perfection. Rolling back a bad deployment, failing over to a healthy replica, shedding non-critical load, or disabling a misbehaving feature with a feature flag — these actions restore service quickly even when the root cause is not yet understood. The techniques of degradation and load balancing are essential tools in the mitigation toolkit. A well-practiced team can mitigate most incidents in minutes.
Short-term fixes address the immediate technical cause. If a memory leak crashed a service, the short-term fix patches the leak. If a configuration change caused a cascading failure, the short-term fix reverts the configuration and adds validation. These fixes are deployed through the normal release process, with extra scrutiny given the recent incident.
Long-term corrective actions come from the post-incident review and target the systemic conditions that allowed the incident to occur. These might include adding monitoring for a previously unobserved failure mode, improving capacity planning, or redesigning a component to eliminate a class of failures. Action items must be tracked to completion — an action item that is assigned but never finished provides no protection against recurrence. Verifying that repairs actually work, through testing and monitoring, closes the remediation loop.