Chapter 14: Monitoring

A distributed system that you cannot observe is a distributed system you cannot operate. When something goes wrong — and in a planetary scale computer, something is always going wrong somewhere — operators need to know what is happening, where it is happening, and ideally why it is happening. A monitoring service collects, stores, and exposes health and performance data from every service in the system.

Monitoring serves two audiences. For humans, it provides dashboards, alerts, and diagnostic data to understand system behavior and respond to incidents. For machines, it provides health signals that enable automated actions like load shedding, failover, and auto-scaling. The routing service, for example, can use health signals from monitoring to avoid sending traffic to unhealthy servers.

Interface

monitoring/src/lib.rs The monitoring service exposes four procedures. The report procedure accepts metric data points from services. The heartbeat procedure accepts health status updates. The query procedure retrieves metric time series, and the health procedure returns the health status of all known services.

pub const REPORT_PROCEDURE: ProcedureId = 1;
pub const HEARTBEAT_PROCEDURE: ProcedureId = 2;
pub const QUERY_PROCEDURE: ProcedureId = 3;
pub const HEALTH_PROCEDURE: ProcedureId = 4;

#[derive(Debug, Serializable, Deserializable)]
pub struct ReportArgs {
    pub service: String,
    pub metric: String,
    pub value: i32,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct HeartbeatArgs {
    pub service: String,
    pub status: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct QueryArgs {
    pub service: String,
    pub metric: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct QueryResult {
    pub values: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct HealthArgs {
    pub placeholder: i32,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct HealthResult {
    pub services: String,
}

The ReportArgs structure uses a generic service and metric pair to identify what is being measured, with an integer value for the measurement. This simple schema can represent a wide variety of metrics: request counts, latencies, queue depths, cache hit rates, and more.

Implementation

monitoring/src/main.rs The monitoring server maintains two data structures: a health registry tracking the status and last heartbeat time of each service, and a metrics store holding rolling windows of reported values.

const HEARTBEAT_TIMEOUT: Duration = Duration::from_secs(30);
const MAX_METRIC_WINDOW: usize = 100;

struct ServiceHealth {
    status: String,
    last_heartbeat: Instant,
}

struct MonitoringState {
    health: HashMap<String, ServiceHealth>,
    metrics: HashMap<String, Vec<i32>>,
}

The heartbeat handler updates the health status and timestamp for a service. Services are expected to send heartbeats periodically; if a service misses its heartbeat window, the monitoring system marks it as unhealthy:

pub async fn heartbeat(
    payload: &str,
    state: &mut MonitoringState,
) -> Response {
    let args = HeartbeatArgs::deserialize(payload)
        .expect("Failed to deserialize payload");
    let health = state.health.entry(args.service.clone())
        .or_insert(ServiceHealth {
            status: args.status.clone(),
            last_heartbeat: Instant::now(),
        });
    health.status = args.status;
    health.last_heartbeat = Instant::now();
    Response { payload: "OK".to_string() }
}

The report handler stores metric values in a rolling window. Each metric key is formed by combining the service name and metric name (e.g., storage:latency). The window holds the most recent 100 values, which provides enough data for computing statistics like averages and percentiles without unbounded memory growth:

pub async fn report(
    payload: &str,
    state: &mut MonitoringState,
) -> Response {
    let args = ReportArgs::deserialize(payload)
        .expect("Failed to deserialize payload");
    let key = format!("{}:{}", args.service, args.metric);
    let values = state.metrics.entry(key).or_insert_with(Vec::new);
    values.push(args.value);
    if values.len() > MAX_METRIC_WINDOW {
        values.remove(0);
    }
    Response { payload: "OK".to_string() }
}

A background task periodically checks for stale services — those that have not sent a heartbeat within the timeout period — and marks them as unhealthy. This is the monitoring system's primary mechanism for detecting failures:

fn check_stale_services(&mut self) {
    let now = Instant::now();
    for (service, health) in self.health.iter_mut() {
        if now.duration_since(health.last_heartbeat) > HEARTBEAT_TIMEOUT {
            if health.status != "unhealthy" {
                println!("Service {} marked unhealthy (heartbeat timeout)",
                    service);
                health.status = "unhealthy".to_string();
            }
        }
    }
}

Design Discussion

The heartbeat pattern is a simple and effective way to detect service failures. Each service periodically sends a “I'm alive” message to the monitoring system. If the monitoring system doesn't hear from a service within a timeout period, it assumes the service has failed. The timeout must be tuned carefully: too short and healthy services might be marked unhealthy due to momentary network delays; too long and actual failures take too long to detect.

The rolling metric window is a compromise between memory efficiency and data retention. A fixed window of 100 values provides enough data for basic statistics while bounding memory usage. Production monitoring systems like Prometheus use more sophisticated storage with configurable retention periods and downsampling for older data.

An important architectural principle is that monitoring should be a pull or push system, but not both. Our implementation uses a push model: services send metrics and heartbeats to the monitoring system. The alternative, a pull model (used by Prometheus), has the monitoring system actively scrape metrics from each service. Push is simpler for services but makes it harder to detect when a service has disappeared entirely. Pull makes failure detection automatic but requires the monitoring system to know about all services in advance.

The health procedure returns all service statuses in a single response, making it easy for other systems (like the frontend dashboard) to display a comprehensive view of system health. This aggregation is a common pattern in monitoring systems and forms the basis for status pages and operational dashboards.