Chapter 20: Degradation
When demand exceeds capacity, a system has two choices: fail completely or degrade gracefully. Graceful degradation means continuing to provide reduced functionality rather than returning errors to all users. It is the difference between a slow website and an unreachable one.
Degradation strategies include load shedding (rejecting a fraction of requests to protect the rest), feature reduction (disabling expensive features to reduce resource consumption), and priority queuing (serving high-priority requests before low-priority ones). Each strategy trades some functionality for continued availability.
In our system, the caching service provides a natural degradation path for the storage service. If the storage service becomes overloaded, the caching service can serve stale data rather than letting requests fail entirely. The routing service can stop sending traffic to overloaded backends, spreading load to healthier instances.
Degradation must be tested. A degradation strategy that has never been exercised is a degradation strategy that does not work. Chaos engineering — deliberately injecting failures in production — is the practice of verifying that degradation mechanisms work as designed before they are needed in earnest.