Chapter 21: Load Balancing
When multiple servers can handle the same request, a load balancer decides which server should receive each request. Effective load balancing ensures that no single server becomes a bottleneck while others sit idle. It is a fundamental technique for achieving both scalability and reliability in distributed systems.
Gateway Load Balancer
Our system's entry point is a gateway load balancer that sits in front of multiple frontend instances. It maintains a pool of backends, each tracked with its health status and active connection count:
Seeloadbalancer/src/main.rs for the full implementation.
struct Backend {
address: String,
healthy: bool,
active_connections: usize,
}
struct LoadBalancer {
backends: Vec<Backend>,
strategy: String,
next_index: usize,
}
The load balancer discovers its backends dynamically. Every five seconds, a
background task calls discovery::list("frontend") to refresh the backend
list. New backends are added automatically; backends that disappear from
discovery are removed.
Balancing Strategies
The gateway supports four strategies, selectable at runtime via the
/__lb_strategy endpoint or the STRATEGY environment variable:
Round-robin distributes requests sequentially, skipping unhealthy backends. It works well when servers and requests are homogeneous:
// round-robin: increment next_index, skip unhealthy
let start = self.next_index;
let total = self.backends.len();
for offset in 0..total {
let idx = (start + offset) % total;
if self.backends[idx].healthy {
self.next_index = (idx + 1) % total;
return Some(idx);
}
}
Least-connections sends each request to the healthy backend with the fewest active connections. This naturally accounts for heterogeneous request costs — slow requests keep a connection open longer, directing traffic elsewhere:
// least-connections: pick the healthy backend with fewest active
let mut best = healthy[0];
for &i in &healthy {
if self.backends[i].active_connections
< self.backends[best].active_connections
{
best = i;
}
}
Some(best)
Random selects a healthy backend uniformly at random. Simple and stateless, but it can produce uneven distribution with small backend counts.
Power of two random choices (pick-2) is a particularly elegant algorithm: pick two random healthy backends and choose the one with fewer active connections. Research shows this achieves exponentially better load distribution than pure random selection, with minimal coordination.
// pick-2: choose 2 random, take the less loaded
let a = healthy[rng.gen_range(0..healthy.len())];
let mut b = a;
while b == a {
b = healthy[rng.gen_range(0..healthy.len())];
}
if backends[a].active_connections <= backends[b].active_connections {
Some(a)
} else {
Some(b)
}
Health Checking
A background loop probes each backend every three seconds with a TCP connection attempt. Backends that fail the probe are marked unhealthy and excluded from selection. When they recover, they are automatically re-included:
// Health check loop: probe each backend every 3 seconds
let health_lb = Arc::clone(&lb);
tokio::spawn(async move {
loop {
sleep(HEALTH_CHECK_INTERVAL).await;
let mut lb = health_lb.lock().await;
for backend in lb.backends.iter_mut() {
let was_healthy = backend.healthy;
backend.healthy =
TcpStream::connect(&backend.address).await.is_ok();
if was_healthy != backend.healthy {
println!("Backend {} is now {}",
backend.address,
if backend.healthy { "healthy" } else { "unhealthy" });
}
}
}
});
The health check is deliberately simple: a TCP connection attempt. If
the connection succeeds, the backend is healthy. If it fails, it is marked
unhealthy and excluded from select_backend until the next successful
probe. The /__lb_status endpoint exposes the full state of the
backend pool as JSON for the dashboard.
Backend Discovery
loadbalancer/src/main.rs
The load balancer does not use a static configuration file. Instead, a
background task calls discovery::list("frontend") every five seconds
to discover which frontend instances are currently registered. New backends
are added automatically; backends that have deregistered are removed:
fn refresh_backends(&mut self, addresses: &[String]) {
// Add new backends
for addr in addresses {
if !self.backends.iter().any(|b| &b.address == addr) {
self.backends.push(Backend {
address: addr.clone(),
healthy: true,
active_connections: 0,
});
}
}
// Remove stale backends
self.backends.retain(|b| addresses.contains(&b.address));
}
This dynamic discovery means the load balancer automatically adapts as the scheduler scales the frontend fleet up or down. No restarts or configuration changes are needed.
Active Connection Tracking
The load balancer tracks active connections per backend to support
the least-connections and pick-2 strategies. When a backend is selected,
its active_connections counter is incremented. When the proxied
request completes (whether successfully or not), the counter is decremented:
// Select backend and increment active connections
let (backend_addr, backend_idx) = {
let mut lb = lb.lock().await;
match lb.select_backend() {
Some(idx) => {
lb.backends[idx].active_connections += 1;
(lb.backends[idx].address.clone(), idx)
}
None => { /* return 503 */ }
}
};
// ... proxy the request ...
// Decrement active connections when done
lb.lock().await.backends[backend_idx]
.active_connections = lb.lock().await.backends[backend_idx]
.active_connections.saturating_sub(1);
The saturating_sub prevents underflow if a backend is removed from
the pool while a request is in flight. This bookkeeping is the foundation
that makes load-aware strategies like least-connections and pick-2
effective.
Edge Load Balancing
Every data center in our system is both an edge
and an origin. Each region (SFO, NYC, AMS) runs the full stack —
frontends, services, storage — with a gateway load balancer at the entry
point. With federated
discovery, calling discovery::list("frontend") returns both local
backends (127.0.0.1:808x) and remote backends from other regions
(10.0.0.2:808x, 10.0.0.3:808x). The load balancer uses this
information to make intelligent routing decisions.
Local Preference
Under normal load, the load balancer routes exclusively to local backends. A request entering the SFO gateway is served by SFO frontends, keeping latency minimal. Each backend is classified at discovery time:
Thelocal flag is set when a backend is first
added from discovery results. Local addresses start with
127.0.0.1; everything else is remote.
struct Backend {
address: String,
healthy: bool,
active_connections: usize,
local: bool,
}
fn is_local(addr: &str) -> bool {
addr.starts_with("127.0.0.1")
}
Utilization-Based Shedding
When local backends approach capacity, the load balancer begins shedding traffic to remote regions. The threshold is based on connection utilization: when active connections across local backends exceed 80% of maximum capacity, remote backends are added to the selection pool:
// Calculate local utilization
let local_util = total_local_connections as f64
/ (local_count * MAX_CONNECTIONS_PER_BACKEND) as f64;
if local_util < SHED_THRESHOLD && !local_healthy.is_empty() {
// Route to local backends only
self.select_from(&local_healthy)
} else {
// Shedding: include remote backends
self.select_from(&all_eligible)
}
Once shedding activates, the configured strategy (round-robin, least-connections, pick-2) distributes requests across the combined pool of local and remote backends. As local utilization drops below the threshold, traffic returns to local-only routing automatically.
Manual Drain
The dashboard exposes drain and undrain controls per region. Draining a region excludes all its backends from selection — useful for rolling deployments or planned maintenance. The drain filter runs before any strategy logic:
// Filter out backends in drained regions
let eligible: Vec<usize> = self.backends.iter()
.enumerate()
.filter(|(_, b)| {
let region = region_for_address(&b.address, &self.own_region);
!self.drained_regions.contains(®ion)
})
.map(|(i, _)| i)
.collect();
Drain state is held in memory. A restart clears
all drains, which is the correct behavior: if the LB restarts, all
regions should start active.
Latency Trade-offs
Shedding to a remote region adds cross-region round-trip time: roughly 74ms between SFO and NYC, 164ms between SFO and AMS. This is a deliberate trade-off. A request served in 164ms is better than a request dropped because local backends are saturated. The shedding threshold (80%) provides a buffer — local backends still have 20% headroom for in-flight requests while remote backends absorb the overflow.
Service-Level Load Balancing
Load balancing also happens inside the system. The
routing service maintains a
multi-backend connection pool for each service it routes to. When a
request arrives for a service like "storage", routing calls
discovery::list("storage") to get all registered backends, then selects
one using the same strategy set (round-robin, least-connections, random,
or pick-2).
routing/src/lib.rs for the ConnectionPool implementation.
struct BackendPool {
address: String,
connections: VecDeque<TcpStream>,
}
struct ConnectionPool {
system_name: String,
backends: Vec<BackendPool>,
strategy: String,
next_index: usize,
max_per_backend: usize,
}
The routing pool refreshes backends from discovery every ten seconds, adding new backends and removing stale ones. Each backend maintains its own connection queue, so connections are reused efficiently and directed to the correct backend on release.
This two-layer approach — gateway balancing at the edge, service-level balancing internally — provides defense in depth. Even if the gateway uses simple round-robin, the routing layer can independently optimize traffic to individual services based on their characteristics.
Design Discussion
The choice of balancing strategy depends on the workload. Round-robin works well when requests are roughly equal in cost and servers are homogeneous. Least-connections adapts naturally to heterogeneous request costs: a slow request occupies a connection longer, directing subsequent traffic to less-loaded servers. Pick-2 strikes a balance, providing most of the benefit of least-connections with the simplicity of random selection.
A production load balancer would add several features beyond what we have implemented. Weighted backends allow servers of different capacities to receive proportional traffic. Session affinity (sticky sessions) ensures that requests from the same client reach the same backend, important for stateful applications. Circuit breaking removes backends that repeatedly fail rather than probing them indefinitely. Graceful draining stops sending new requests to a backend being decommissioned while allowing existing connections to complete.
The two-layer architecture — gateway balancing at the edge, service-level balancing inside the routing layer — provides defense in depth. Even if the gateway uses simple round-robin, the routing layer can independently optimize traffic to individual backend services. This separation also means the gateway can be replaced (for example, with nginx or HAProxy) without affecting internal routing.