Chapter 2: Design

Before writing a single line of code, a systems engineer must answer a fundamental question: what problem are we solving? The design process is where ambiguity is transformed into clarity, where requirements are distilled into interfaces, and where trade-offs are made explicit before they become expensive to change. Good design is the foundation upon which reliable systems are built.

In the context of planetary scale computing, design takes on additional dimensions. A system that works on a single machine may fail spectacularly when distributed across thousands of servers. A design that is elegant for ten users may collapse under ten million. The design process must anticipate scale, failure, and evolution from the very beginning.

The Problem Statement

Every system begins with a problem. The problem statement defines what the system must accomplish, who it serves, and what constraints it operates under. A well-written problem statement is specific enough to guide design decisions but general enough to avoid premature commitment to implementation details.

Consider the systems we are building in this book. The echo system's problem statement is simple: given a message from a client, return that same message. The discovery system's problem is more nuanced: given a system name, return the address of a healthy server that implements that system, even as servers come and go. The storage system's problem adds another dimension: persist data durably across process restarts and hardware failures.

Each problem statement implicitly defines the system's scope. A discovery system discovers servers — it does not route requests to them. A storage system stores data — it does not cache it in memory for fast access. Keeping scope tight is one of the most important design principles. When a system tries to do too much, it becomes harder to understand, harder to test, and harder to operate.

The Design Document

A design document translates a problem statement into a concrete plan. It typically includes four sections: the interface the system will expose, the data structures it will maintain, the algorithms it will use, and the trade-offs it accepts. The design document is a contract between the designer and the implementer — even when they are the same person.

The interface section defines how other systems will interact with this one. In our systems, interfaces are defined as RPC procedures with typed arguments and results. The configuration service, for example, defines five procedures: get, set, delete, list, and watch. Each procedure has a unique identifier, a request structure, and a response structure. This interface is specified in the shared library (lib.rs) before any server code is written.

The data structures section describes what state the system maintains and how it is organized. The caching service maintains a hash map for fast lookups and a deque for LRU ordering. The storage service maintains an in-memory hash map, a write-ahead log, and periodic snapshots. These choices directly affect the system's performance characteristics.

The algorithms section describes how the system processes requests. For many of our services, this is straightforward: deserialize the request, perform an operation on the data structure, serialize the response. For more complex systems like consensus, the algorithm section describes election protocols, log replication, and state machine application.

The trade-offs section is perhaps the most important. Every design decision involves trade-offs, and making them explicit prevents surprises later. The caching service trades memory for speed. The storage service trades write latency (writing to the WAL before acknowledging) for durability. The configuration service trades consistency (eventual notification via broadcast channels) for availability.

Interface-First Design

A pattern that appears throughout our systems is interface-first design. The shared library (lib.rs) defines the interface before the server (main.rs) implements it. This has several advantages.

First, it forces the designer to think about the system from the client's perspective. What operations does a client need? What data does it send and receive? This outside-in thinking produces cleaner interfaces than starting from the implementation and working outward.

Second, it enables parallel development. Once the interface is defined, clients can be written against it (using stubs or mocks) while the server is being implemented. In a large organization, different teams can work on the client and server simultaneously.

Third, it provides a natural versioning boundary. When the interface changes, the procedure identifier changes, and both old and new versions can coexist during the transition. This is critical for systems that cannot tolerate downtime during upgrades.

Resources

Every system consumes resources: CPU, memory, storage, network bandwidth, and file descriptors, among others. A good design accounts for resource usage and establishes budgets. The caching service has a MAX_CAPACITY that bounds memory usage. The monitoring service has a MAX_METRIC_WINDOW that bounds the number of data points stored per metric.

Resource budgets interact with each other. A system that uses less memory might need more CPU (for compression). A system that uses less network bandwidth might need more storage (for batching). Understanding these interactions is a key part of the design process.

Management

A system that cannot be managed cannot be operated at scale. Design must include management concerns from the beginning: how will the system be configured? How will its health be monitored? How will it be deployed and updated? How will operators debug problems?

Our systems address these concerns through integration with the configuration, monitoring, and discovery services. Each service registers itself with discovery on startup, reports metrics to monitoring, and reads runtime parameters from configuration. This management infrastructure is as important as the business logic itself.

The design process is iterative. A first design is rarely the final design. As implementation reveals unforeseen challenges, as testing exposes edge cases, and as operation surfaces real-world behavior, the design evolves. The key is to make the design explicit and revisable, not to make it perfect on the first attempt.