Chapter 5: Discovery

In a planetary scale computer, servers are ephemeral. They start, stop, move between machines, and scale up and down in response to load. A client that hardcodes the address of a server it depends on will break the moment that server moves. The discovery service solves this problem by maintaining a dynamic registry of which servers are available and where they can be found.

We introduced discovery briefly in Chapter 1. Here we examine the implementation in depth: the registry data structure, the mechanisms for keeping it current, and the design decisions that make discovery reliable enough to serve as the foundation for all other service communication.

Interface

discovery/src/lib.rs The discovery interface is deliberately minimal: two procedures. Register allows a server to announce itself, and query allows a client to find a server. This simplicity is intentional — a discovery service that is complex is a discovery service that fails in complex ways:

pub const REGISTER_PROCEDURE: ProcedureId = 1;
pub const QUERY_PROCEDURE: ProcedureId = 2;

#[derive(Debug, Serializable, Deserializable)]
pub struct RegisterArgs {
    pub name: String,
    pub address: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct QueryArgs {
    pub name: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct QueryResult {
    pub address: String,
}

The Registry

discovery/src/main.rs The heart of the discovery service is the Registry data structure. It maintains two maps: one from system names to lists of server addresses, and another tracking when each address last checked in. This dual-map design separates the logical concern (which servers implement which systems) from the operational concern (which servers are still alive):

#[derive(Default)]
pub struct Registry {
    registry: HashMap<Name, Vec<Address>>,
    last_ping: HashMap<Address, Instant>,
}

impl Registry {
    fn register(&mut self, name: Name, address: Address) {
        if let Some(time) = self.last_ping.get_mut(&address) {
            *time = Instant::now();
        } else {
            self.registry.entry(name)
                .or_insert_with(Vec::new)
                .push(address.clone());
            self.last_ping.insert(address, Instant::now());
        }
    }

    fn get_address(&self, name: &Name) -> Option<&Address> {
        self.registry.get(name)?.choose(&mut rand::thread_rng())
    }
}

The register method handles both initial registration and re-registration. When an address that is already known re-registers, only the timestamp is updated — the address is not duplicated in the registry. This idempotency is important because servers register periodically, and duplicating entries would bias the random selection in get_address.

The get_address method uses random selection to distribute load across all servers implementing a system. This provides a simple form of load balancing: if three servers implement the echo system, each query has roughly a one-in-three chance of returning each server's address.

Stale Entry Cleanup

Servers can fail without deregistering — a crash, a network partition, or a hardware failure will leave stale entries in the registry. The cleanup mechanism removes addresses that have not sent a heartbeat within a configurable duration:

fn cleanup_stale(&mut self) {
    let now = Instant::now();
    let stale_addresses: HashSet<_> = self.last_ping.iter()
        .filter(|&(_, time)|
            now.duration_since(*time) > CLEANUP_DURATION)
        .map(|(address, _)| address.clone())
        .collect();

    for address in stale_addresses {
        self.last_ping.remove(&address);
        self.registry.retain(|_, v| {
            v.retain(|a| a != &address);
            !v.is_empty()
        });
    }
}

The cleanup runs as a background task, periodically scanning the registry for entries whose last ping timestamp exceeds the threshold. When a stale address is found, it is removed from both maps. If removing the address leaves a system name with no servers, the system name itself is removed.

Registration with Exponential Backoff

The client side of discovery is equally important. Each server must continuously re-register itself to prevent being cleaned up as stale. The registration function spawns a background task that registers immediately and then re-registers at a regular interval:

pub fn register(name: String, address: String) {
    tokio::spawn(async move {
        let mut retries = 0;
        let max_retries = 10;
        loop {
            // Register with discovery service
            let args = RegisterArgs {
                name: name.clone(),
                address: address.clone(),
            };
            match client::send_request(
                DISCOVERY_ADDRESS,
                Request {
                    procedure_id: REGISTER_PROCEDURE,
                    payload: args.serialize(),
                },
            ).await {
                Ok(_) => { retries = 0; }
                Err(_) => {
                    retries += 1;
                    if retries >= max_retries { break; }
                }
            }

            // Re-register periodically
            tokio::time::sleep(Duration::from_secs(5)).await;
        }
    });
}

When the discovery service is unavailable, the registration uses exponential backoff with a maximum delay. This prevents a thundering herd of re-registration attempts when the discovery service restarts after an outage. The delay starts at one second and increases up to sixty seconds, with each retry doubling the previous delay.

Design Discussion

A critical question is how services find the discovery service itself. Our implementation uses a well-known address (127.0.0.1:10200). In production, there are several alternatives. DNS can map a well-known hostname to the discovery service's current address. Multicast or broadcast protocols can announce the discovery service on a local network. Or a small set of stable addresses can be hardcoded as the discovery service's home.

The random selection in get_address provides basic load balancing but is not aware of server health or load. The routing service builds on top of discovery to add health-aware routing and connection pooling. This layering — discovery handles location, routing handles health and efficiency — keeps each service focused and simple.

The cleanup duration creates a trade-off between responsiveness and stability. A short cleanup duration (say, 5 seconds) quickly removes failed servers but may incorrectly remove servers that are merely slow to re-register. A longer duration (say, 60 seconds) is more forgiving but means clients may be directed to failed servers for up to a minute after they fail. Our implementation uses 10 seconds as a compromise.

In larger deployments, the discovery service itself must be replicated for availability. Since discovery is essentially a registry of key-value pairs that must be consistent across replicas, it is a natural candidate for the consensus protocol we examined in the previous chapter.