Chapter 21: Load Balancing

When multiple servers can handle the same request, a load balancer decides which server should receive each request. Effective load balancing ensures that no single server becomes a bottleneck while others sit idle. It is a fundamental technique for achieving both scalability and reliability in distributed systems.

Gateway Load Balancer

Our system's entry point is a gateway load balancer that sits in front of multiple frontend instances. It maintains a pool of backends, each tracked with its health status and active connection count:

See loadbalancer/src/main.rs for the full implementation.

struct Backend {
    address: String,
    healthy: bool,
    active_connections: usize,
}

struct LoadBalancer {
    backends: Vec<Backend>,
    strategy: String,
    next_index: usize,
}

The load balancer discovers its backends dynamically. Every five seconds, a background task calls discovery::list("frontend") to refresh the backend list. New backends are added automatically; backends that disappear from discovery are removed.

Balancing Strategies

The gateway supports four strategies, selectable at runtime via the /__lb_strategy endpoint or the STRATEGY environment variable:

Round-robin distributes requests sequentially, skipping unhealthy backends. It works well when servers and requests are homogeneous:

// round-robin: increment next_index, skip unhealthy
let start = self.next_index;
let total = self.backends.len();
for offset in 0..total {
    let idx = (start + offset) % total;
    if self.backends[idx].healthy {
        self.next_index = (idx + 1) % total;
        return Some(idx);
    }
}

Least-connections sends each request to the healthy backend with the fewest active connections. This naturally accounts for heterogeneous request costs — slow requests keep a connection open longer, directing traffic elsewhere:

// least-connections: pick the healthy backend with fewest active
let mut best = healthy[0];
for &i in &healthy {
    if self.backends[i].active_connections
        < self.backends[best].active_connections
    {
        best = i;
    }
}
Some(best)

Random selects a healthy backend uniformly at random. Simple and stateless, but it can produce uneven distribution with small backend counts.

Power of two random choices (pick-2) is a particularly elegant algorithm: pick two random healthy backends and choose the one with fewer active connections. Research shows this achieves exponentially better load distribution than pure random selection, with minimal coordination.

// pick-2: choose 2 random, take the less loaded
let a = healthy[rng.gen_range(0..healthy.len())];
let mut b = a;
while b == a {
    b = healthy[rng.gen_range(0..healthy.len())];
}
if backends[a].active_connections <= backends[b].active_connections {
    Some(a)
} else {
    Some(b)
}

Health Checking

A background loop probes each backend every three seconds with a TCP connection attempt. Backends that fail the probe are marked unhealthy and excluded from selection. When they recover, they are automatically re-included:

// Health check loop: probe each backend every 3 seconds
let health_lb = Arc::clone(&lb);
tokio::spawn(async move {
    loop {
        sleep(HEALTH_CHECK_INTERVAL).await;
        let mut lb = health_lb.lock().await;
        for backend in lb.backends.iter_mut() {
            let was_healthy = backend.healthy;
            backend.healthy =
                TcpStream::connect(&backend.address).await.is_ok();
            if was_healthy != backend.healthy {
                println!("Backend {} is now {}",
                    backend.address,
                    if backend.healthy { "healthy" } else { "unhealthy" });
            }
        }
    }
});

The health check is deliberately simple: a TCP connection attempt. If the connection succeeds, the backend is healthy. If it fails, it is marked unhealthy and excluded from select_backend until the next successful probe. The /__lb_status endpoint exposes the full state of the backend pool as JSON for the dashboard.

Backend Discovery

loadbalancer/src/main.rs The load balancer does not use a static configuration file. Instead, a background task calls discovery::list("frontend") every five seconds to discover which frontend instances are currently registered. New backends are added automatically; backends that have deregistered are removed:

fn refresh_backends(&mut self, addresses: &[String]) {
    // Add new backends
    for addr in addresses {
        if !self.backends.iter().any(|b| &b.address == addr) {
            self.backends.push(Backend {
                address: addr.clone(),
                healthy: true,
                active_connections: 0,
            });
        }
    }

    // Remove stale backends
    self.backends.retain(|b| addresses.contains(&b.address));
}

This dynamic discovery means the load balancer automatically adapts as the scheduler scales the frontend fleet up or down. No restarts or configuration changes are needed.

Active Connection Tracking

The load balancer tracks active connections per backend to support the least-connections and pick-2 strategies. When a backend is selected, its active_connections counter is incremented. When the proxied request completes (whether successfully or not), the counter is decremented:

// Select backend and increment active connections
let (backend_addr, backend_idx) = {
    let mut lb = lb.lock().await;
    match lb.select_backend() {
        Some(idx) => {
            lb.backends[idx].active_connections += 1;
            (lb.backends[idx].address.clone(), idx)
        }
        None => { /* return 503 */ }
    }
};

// ... proxy the request ...

// Decrement active connections when done
lb.lock().await.backends[backend_idx]
    .active_connections = lb.lock().await.backends[backend_idx]
    .active_connections.saturating_sub(1);

The saturating_sub prevents underflow if a backend is removed from the pool while a request is in flight. This bookkeeping is the foundation that makes load-aware strategies like least-connections and pick-2 effective.

Edge Load Balancing

Every data center in our system is both an edge and an origin. Each region (SFO, NYC, AMS) runs the full stack — frontends, services, storage — with a gateway load balancer at the entry point. With federated discovery, calling discovery::list("frontend") returns both local backends (127.0.0.1:808x) and remote backends from other regions (10.0.0.2:808x, 10.0.0.3:808x). The load balancer uses this information to make intelligent routing decisions.

Local Preference

Under normal load, the load balancer routes exclusively to local backends. A request entering the SFO gateway is served by SFO frontends, keeping latency minimal. Each backend is classified at discovery time:

The local flag is set when a backend is first added from discovery results. Local addresses start with 127.0.0.1; everything else is remote.

struct Backend {
    address: String,
    healthy: bool,
    active_connections: usize,
    local: bool,
}

fn is_local(addr: &str) -> bool {
    addr.starts_with("127.0.0.1")
}

Utilization-Based Shedding

When local backends approach capacity, the load balancer begins shedding traffic to remote regions. The threshold is based on connection utilization: when active connections across local backends exceed 80% of maximum capacity, remote backends are added to the selection pool:

// Calculate local utilization
let local_util = total_local_connections as f64
    / (local_count * MAX_CONNECTIONS_PER_BACKEND) as f64;

if local_util < SHED_THRESHOLD && !local_healthy.is_empty() {
    // Route to local backends only
    self.select_from(&local_healthy)
} else {
    // Shedding: include remote backends
    self.select_from(&all_eligible)
}

Once shedding activates, the configured strategy (round-robin, least-connections, pick-2) distributes requests across the combined pool of local and remote backends. As local utilization drops below the threshold, traffic returns to local-only routing automatically.

Manual Drain

The dashboard exposes drain and undrain controls per region. Draining a region excludes all its backends from selection — useful for rolling deployments or planned maintenance. The drain filter runs before any strategy logic:

// Filter out backends in drained regions
let eligible: Vec<usize> = self.backends.iter()
    .enumerate()
    .filter(|(_, b)| {
        let region = region_for_address(&b.address, &self.own_region);
        !self.drained_regions.contains(&region)
    })
    .map(|(i, _)| i)
    .collect();

Drain state is held in memory. A restart clears all drains, which is the correct behavior: if the LB restarts, all regions should start active.

Latency Trade-offs

Shedding to a remote region adds cross-region round-trip time: roughly 74ms between SFO and NYC, 164ms between SFO and AMS. This is a deliberate trade-off. A request served in 164ms is better than a request dropped because local backends are saturated. The shedding threshold (80%) provides a buffer — local backends still have 20% headroom for in-flight requests while remote backends absorb the overflow.

Service-Level Load Balancing

Load balancing also happens inside the system. The routing service maintains a multi-backend connection pool for each service it routes to. When a request arrives for a service like "storage", routing calls discovery::list("storage") to get all registered backends, then selects one using the same strategy set (round-robin, least-connections, random, or pick-2).

See routing/src/lib.rs for the ConnectionPool implementation.

struct BackendPool {
    address: String,
    connections: VecDeque<TcpStream>,
}

struct ConnectionPool {
    system_name: String,
    backends: Vec<BackendPool>,
    strategy: String,
    next_index: usize,
    max_per_backend: usize,
}

The routing pool refreshes backends from discovery every ten seconds, adding new backends and removing stale ones. Each backend maintains its own connection queue, so connections are reused efficiently and directed to the correct backend on release.

This two-layer approach — gateway balancing at the edge, service-level balancing internally — provides defense in depth. Even if the gateway uses simple round-robin, the routing layer can independently optimize traffic to individual services based on their characteristics.

Design Discussion

The choice of balancing strategy depends on the workload. Round-robin works well when requests are roughly equal in cost and servers are homogeneous. Least-connections adapts naturally to heterogeneous request costs: a slow request occupies a connection longer, directing subsequent traffic to less-loaded servers. Pick-2 strikes a balance, providing most of the benefit of least-connections with the simplicity of random selection.

A production load balancer would add several features beyond what we have implemented. Weighted backends allow servers of different capacities to receive proportional traffic. Session affinity (sticky sessions) ensures that requests from the same client reach the same backend, important for stateful applications. Circuit breaking removes backends that repeatedly fail rather than probing them indefinitely. Graceful draining stops sending new requests to a backend being decommissioned while allowing existing connections to complete.

The two-layer architecture — gateway balancing at the edge, service-level balancing inside the routing layer — provides defense in depth. Even if the gateway uses simple round-robin, the routing layer can independently optimize traffic to individual backend services. This separation also means the gateway can be replaced (for example, with nginx or HAProxy) without affecting internal routing.