Chapter 10: Operation

Building a system is only half the work. The other half is keeping it running. Operation is the practice of deploying, monitoring, maintaining, and evolving systems in production. A system that is difficult to operate will eventually fail, no matter how well it is designed and implemented.

Starting Services

Our planetary scale computer consists of multiple services that must start in a specific order. The discovery service must be available before other services can register. The configuration service should start next, since other services may read their settings from it. Then the remaining services — storage, caching, monitoring, routing, and application services — can start in any order, as they will retry registration with discovery until they succeed.

The start.sh script encodes this ordering. It launches each service as a background process, waits briefly for startup, and proceeds to the next. This is adequate for development but insufficient for production, where services should be managed by a process supervisor that handles restarts, resource limits, and dependency ordering.

Health Checks

Once services are running, operators need to know whether they are healthy. A running process is not necessarily a healthy process — it may be stuck in a deadlock, overwhelmed by traffic, or unable to reach its dependencies. Health checks provide a standard way for services to report their current state.

Our services use the monitoring service's heartbeat mechanism for health reporting. Each service periodically sends a heartbeat with its status. The monitoring service marks any service that misses its heartbeat window as unhealthy. This information feeds into the dashboard, alerting operators to problems, and into the routing service, which avoids sending traffic to unhealthy servers.

Observability

Health checks tell you whether something is wrong. Observability tells you what and why. The three pillars of observability are metrics (numerical measurements over time), logs (discrete events), and traces (the path of a request through multiple services).

Our monitoring service handles metrics. Each service can report arbitrary metric values (request counts, latencies, error rates) which are stored in rolling windows. For logs, each service writes to standard output, which can be collected and aggregated by log management systems. Distributed tracing — following a single request as it traverses discovery, routing, and backend services — is an area where our system could be extended.

Operational Runbooks

When something goes wrong at 3 AM, operators should not need to reason from first principles. Operational runbooks document common failure scenarios and their remediation steps. For our system, key runbooks would include: what to do when the discovery service is down (restart it; other services will re-register automatically), when storage is full (trigger compaction or expand capacity), and when a consensus ensemble loses quorum (identify and restart the failed members).

The best runbooks are written by the people who built the system, updated by the people who operate it, and tested regularly to ensure they still work. Over time, the most common runbook steps should be automated, reducing the burden on human operators.