Chapter 24: Geo Replication

A system that runs in a single datacenter is a system with a single point of failure. No matter how many replicas you run, no matter how carefully you design your failover, a backhoe through the fiber, a power grid outage, or a cooling system failure can take everything offline at once. Global distribution — running the full system in multiple geographic regions — is how planetary scale systems survive regional failures and serve users with low latency worldwide.

Why Distribute

Three forces drive global distribution. Latency: light takes 74ms to travel from San Francisco to New York and back, and 165ms to Amsterdam. Users notice. Running a copy of the system in each region means most requests are served locally. Availability: independent infrastructure in separate regions means a failure in one region does not affect the others. If the SFO region goes dark, NYC and AMS continue serving traffic. Data sovereignty: some jurisdictions require that data about their citizens remain within their borders. A multi-region architecture makes compliance possible.

Full Stack Per Region

Our approach is simple: each region runs the complete system. Discovery, routing, caching, storage, monitoring, scheduling — every service runs in every region. Local requests never leave the region. Only two things cross regional boundaries: storage replication (so data written in one region eventually appears in others) and cache invalidation (so stale entries are purged everywhere).

This is sometimes called an “active-active” multi-region deployment: every region can handle reads and writes independently, with asynchronous replication keeping them in sync.

The WireGuard Mesh

The regions communicate over a private WireGuard mesh network. Each region has a WireGuard IP on a shared 10.0.0.0/24 subnet:

SFO   10.0.0.1
NYC   10.0.0.2
AMS   10.0.0.3

WireGuard provides encrypted, authenticated tunnels with minimal overhead. The mesh topology means every region can reach every other region directly, without routing through a hub. See Chapter 32: Network for more on network overlays.

This private overlay network means services can bind to 0.0.0.0 and accept connections from both local services (via 127.0.0.1) and remote regions (via the WireGuard interface). The BIND_HOST environment variable controls the listen address, and region.env configures the per-region settings:

# region.env (per droplet)
REGION=sfo
DISCOVERY_PEERS=10.0.0.2:10200,10.0.0.3:10200

Federated Discovery

The key insight of our multi-region architecture is that discovery itself becomes federated. Each region's discovery instance maintains two registries: a local registry of services that registered directly (the same registry from Chapter 5), and a federated registry of services forwarded from peer discovery instances in other regions.

A background task runs every five seconds. It collects all locally-registered services, rewrites their 127.0.0.1 addresses to the region's WireGuard IP, and forwards these registrations to each peer discovery instance using a FEDERATED_REGISTER RPC. The remote discovery stores these in its federated registry with the same staleness-based expiry used for local entries.

This is a gossip-like protocol: each discovery instance tells its peers about the services it knows. Stale entries are cleaned up automatically, so if a region goes offline its services disappear from all registries within seconds.

The result is two complementary views. discovery::list("storage") returns all storage instances across all regions — useful for cache invalidation, which must propagate globally. discovery::list_local("storage") returns only the current region's instances — useful for quorum operations, where cross-region latency would make consensus impractically slow.

WAL Tailer

Cross-region storage replication is handled by the WAL tailer, a lightweight service that reads the write-ahead log files produced by local storage instances and replays them to remote storage instances. It runs one instance per region.

The tailer discovers local storage instances via discovery::list_local("storage") and remote instances by subtracting the local set from discovery::list("storage"). For each local instance, it maintains a byte offset into the WAL file and periodically reads new entries:

// WAL entry format (same as storage engine)
VPUT key=value@version
VDEL key@version

Each new entry is sent to every remote storage instance via storage::replicate_put() or storage::replicate_delete(). The versioned replication protocol ensures idempotency: if an entry has already been applied (because its version is older than the current version for that key), the remote instance simply ignores it.

When storage compaction occurs (the WAL is truncated and a snapshot is written), the tailer detects the file size decrease, reads the snapshot entries, and resends them. This is safe because replicated writes are idempotent.

Cache Invalidation Across Regions

The caching service already supports multiple consistency modes: eventual, quorum, and strong (see Chapter 22). With federated discovery, discovery::list("caching") now returns cache instances from all regions. The existing replication logic — which propagates SET and DELETE operations to peers — automatically extends across regions.

In eventual mode, cache writes are replicated asynchronously to all peers including those in remote regions. In quorum mode, the write waits for one peer acknowledgment, which will typically come from a local peer (since local responses arrive first). In strong mode, the write waits for all peers, including remote ones — which means it pays the full cross-region latency penalty.

Region Configuration

Each region is configured via a region.env file that is sourced at startup. The only required setting is DISCOVERY_PEERS — the WireGuard addresses of the other regions' discovery instances. This single variable bootstraps the entire cross-region topology:

# SFO (10.0.0.1) — peers are NYC and AMS
REGION=sfo
DISCOVERY_PEERS=10.0.0.2:10200,10.0.0.3:10200

# NYC (10.0.0.2) — peers are SFO and AMS
REGION=nyc
DISCOVERY_PEERS=10.0.0.1:10200,10.0.0.3:10200

# AMS (10.0.0.3) — peers are SFO and NYC
REGION=ams
DISCOVERY_PEERS=10.0.0.1:10200,10.0.0.2:10200

Compare this with the previous approach, which required separate STORAGE_PEERS and CACHE_PEERS environment variables listing every remote instance of every service. Federated discovery reduces the configuration surface to a single variable per region.

Latency Trade-offs

The inter-region latencies in our three-region deployment are roughly:

SFO ↔ NYC    ~74ms RTT
SFO ↔ AMS   ~165ms RTT
NYC ↔ AMS    ~92ms RTT

These latencies make cross-region quorum consensus impractical for most workloads. A storage write with W=2 quorum that includes a remote peer would add 74–165ms to every write. Instead, our architecture uses local quorum for consistency within a region (fast, sub-millisecond) and asynchronous replication across regions (eventually consistent, but no latency penalty on the write path).

This is the fundamental trade-off of global distribution: you can have strong consistency across regions or low-latency writes, but not both. Our system chooses low-latency writes with eventual cross-region consistency. For workloads that require stronger guarantees, the caching layer's strong consistency mode is available — at the cost of cross-region latency on every operation.