Chapter 38: Detection

The first step in managing any incident is knowing that something is wrong. Detection is the bridge between a silent failure and an active response. The faster an organization detects an incident, the smaller the blast radius and the shorter the recovery time. Detection latency — the time between a problem starting and someone being alerted — is one of the most important reliability metrics a team can track.

Automated detection relies on the signals collected by the monitoring service: heartbeats, latency percentiles, error rates, saturation metrics, and business-level indicators like order completion rates. Alert thresholds define when these signals cross from normal variation into actionable territory. Setting thresholds too low produces alert fatigue; setting them too high means incidents go unnoticed. Anomaly detection — statistical models that learn normal patterns and flag deviations — can complement static thresholds by catching novel failure modes.

Not all incidents are caught by automated systems. User reports, support tickets, and social media mentions are valuable detection channels, especially for problems that affect the user experience in ways that internal metrics do not capture. A robust detection strategy combines automated monitoring with human observation, ensuring that no category of failure goes unnoticed for long.

On-call engineers are the human link in the detection chain. When an alert fires, the on-call responder must acknowledge it, assess whether it represents a real incident, and decide on next steps. The routing service can automatically shift traffic away from unhealthy backends while the responder investigates, buying time without requiring immediate human intervention.