Chapter 24: Geo Replication
A system that runs in a single datacenter is a system with a single point of failure. No matter how many replicas you run, no matter how carefully you design your failover, a backhoe through the fiber, a power grid outage, or a cooling system failure can take everything offline at once. Global distribution — running the full system in multiple geographic regions — is how planetary scale systems survive regional failures and serve users with low latency worldwide.
Why Distribute
Three forces drive global distribution. Latency: light takes 74ms to travel from San Francisco to New York and back, and 165ms to Amsterdam. Users notice. Running a copy of the system in each region means most requests are served locally. Availability: independent infrastructure in separate regions means a failure in one region does not affect the others. If the SFO region goes dark, NYC and AMS continue serving traffic. Data sovereignty: some jurisdictions require that data about their citizens remain within their borders. A multi-region architecture makes compliance possible.
Full Stack Per Region
Our approach is simple: each region runs the complete system. Discovery, routing, caching, storage, monitoring, scheduling — every service runs in every region. Local requests never leave the region. Only two things cross regional boundaries: storage replication (so data written in one region eventually appears in others) and cache invalidation (so stale entries are purged everywhere).
This is sometimes called an “active-active” multi-region deployment: every region can handle reads and writes independently, with asynchronous replication keeping them in sync.The WireGuard Mesh
The regions communicate over a private WireGuard mesh network. Each region
has a WireGuard IP on a shared 10.0.0.0/24 subnet:
SFO 10.0.0.1
NYC 10.0.0.2
AMS 10.0.0.3
WireGuard provides encrypted, authenticated tunnels with
minimal overhead. The mesh topology means every region can reach every other
region directly, without routing through a hub. See
Chapter 32: Network for more on network
overlays.
This private overlay network means services can bind to 0.0.0.0
and accept connections from both local services (via 127.0.0.1)
and remote regions (via the WireGuard interface). The BIND_HOST
environment variable controls the listen address, and region.env
configures the per-region settings:
# region.env (per droplet)
REGION=sfo
DISCOVERY_PEERS=10.0.0.2:10200,10.0.0.3:10200
Federated Discovery
The key insight of our multi-region architecture is that discovery itself becomes federated. Each region's discovery instance maintains two registries: a local registry of services that registered directly (the same registry from Chapter 5), and a federated registry of services forwarded from peer discovery instances in other regions.
A background task runs every five seconds. It collects all locally-registered
services, rewrites their 127.0.0.1 addresses to the region's
WireGuard IP, and forwards these registrations to each peer discovery instance
using a FEDERATED_REGISTER RPC. The remote discovery stores these
in its federated registry with the same staleness-based expiry used for local
entries.
The result is two complementary views. discovery::list("storage")
returns all storage instances across all regions — useful for cache
invalidation, which must propagate globally.
discovery::list_local("storage") returns only the current region's
instances — useful for quorum operations, where cross-region latency would
make consensus impractically slow.
WAL Tailer
Cross-region storage replication is handled by the WAL tailer, a lightweight service that reads the write-ahead log files produced by local storage instances and replays them to remote storage instances. It runs one instance per region.
The tailer discovers local storage instances via
discovery::list_local("storage") and remote instances by
subtracting the local set from discovery::list("storage"). For
each local instance, it maintains a byte offset into the WAL file and
periodically reads new entries:
// WAL entry format (same as storage engine)
VPUT key=value@version
VDEL key@version
Each new entry is sent to every remote storage instance via
storage::replicate_put() or storage::replicate_delete().
The versioned replication protocol ensures idempotency: if an entry has already
been applied (because its version is older than the current version for that
key), the remote instance simply ignores it.
Cache Invalidation Across Regions
The caching
service already supports multiple consistency modes: eventual, quorum, and
strong (see Chapter 22). With federated
discovery, discovery::list("caching") now returns cache instances
from all regions. The existing replication logic — which propagates
SET and DELETE operations to peers — automatically
extends across regions.
In eventual mode, cache writes are replicated asynchronously to all peers including those in remote regions. In quorum mode, the write waits for one peer acknowledgment, which will typically come from a local peer (since local responses arrive first). In strong mode, the write waits for all peers, including remote ones — which means it pays the full cross-region latency penalty.
Region Configuration
Each region is configured via a region.env file that is sourced
at startup. The only required setting is DISCOVERY_PEERS —
the WireGuard addresses of the other regions' discovery instances. This single
variable bootstraps the entire cross-region topology:
# SFO (10.0.0.1) — peers are NYC and AMS
REGION=sfo
DISCOVERY_PEERS=10.0.0.2:10200,10.0.0.3:10200
# NYC (10.0.0.2) — peers are SFO and AMS
REGION=nyc
DISCOVERY_PEERS=10.0.0.1:10200,10.0.0.3:10200
# AMS (10.0.0.3) — peers are SFO and NYC
REGION=ams
DISCOVERY_PEERS=10.0.0.1:10200,10.0.0.2:10200
Compare this with the previous approach, which required
separate STORAGE_PEERS and CACHE_PEERS environment
variables listing every remote instance of every service. Federated discovery
reduces the configuration surface to a single variable per region.
Latency Trade-offs
The inter-region latencies in our three-region deployment are roughly:
SFO ↔ NYC ~74ms RTT
SFO ↔ AMS ~165ms RTT
NYC ↔ AMS ~92ms RTT
These latencies make cross-region quorum consensus impractical for most
workloads. A storage write with W=2 quorum that includes a
remote peer would add 74–165ms to every write. Instead, our architecture
uses local quorum for consistency within a region (fast, sub-millisecond) and
asynchronous replication across regions (eventually consistent, but no latency
penalty on the write path).
This is the fundamental trade-off of global distribution: you can have strong consistency across regions or low-latency writes, but not both. Our system chooses low-latency writes with eventual cross-region consistency. For workloads that require stronger guarantees, the caching layer's strong consistency mode is available — at the cost of cross-region latency on every operation.