Chapter 8: Storage

All data in a running system lives in memory — fast, volatile, and finite. When a process crashes or a server loses power, that data vanishes. A storage service provides durable data persistence: the guarantee that data written to the service will survive process restarts, hardware failures, and power outages. Storage is the bedrock on which stateful systems are built.

The design of a storage engine involves balancing competing demands. Reads should be fast (ideally served from memory). Writes should be durable (safely on disk before acknowledging). Space should be used efficiently (old data compacted away). And the whole system should recover quickly after a crash. Our implementation addresses each of these concerns with a classic technique: the write-ahead log.

Interface

storage/src/lib.rs The storage service provides four procedures. Get and put form the basic key–value interface for reading and writing data. Delete removes a key, and scan retrieves multiple entries matching a prefix, useful for range queries and enumeration.

pub const GET_PROCEDURE: ProcedureId = 1;
pub const PUT_PROCEDURE: ProcedureId = 2;
pub const DELETE_PROCEDURE: ProcedureId = 3;
pub const SCAN_PROCEDURE: ProcedureId = 4;

#[derive(Debug, Serializable, Deserializable)]
pub struct GetArgs {
    pub key: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct GetResult {
    pub value: String,
    pub found: i32,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct PutArgs {
    pub key: String,
    pub value: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct DeleteArgs {
    pub key: String,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct ScanArgs {
    pub prefix: String,
    pub limit: i32,
}

#[derive(Debug, Serializable, Deserializable)]
pub struct ScanResult {
    pub entries: String,
}

Like the caching service, the GetResult includes a found field to distinguish between a missing key and an empty value. The ScanResult returns entries as semicolon-delimited key=value pairs, keeping the serialization format simple while supporting multi-entry responses.

Implementation

storage/src/engine.rs The heart of the storage service is its StorageEngine. The engine maintains an in-memory HashMap for fast reads, a write-ahead log (WAL) for durability, and a snapshot mechanism for efficient recovery. This combination is a classic pattern found in databases from SQLite to PostgreSQL.

pub struct StorageEngine {
    data: HashMap<String, String>,
    wal_path: PathBuf,
    snapshot_path: PathBuf,
    operations_since_snapshot: usize,
}

const COMPACTION_THRESHOLD: usize = 1000;

The write-ahead log is the key to durability. Every write operation — whether a put or a delete — is first appended to the WAL file on disk before the in-memory data structure is updated. This ensures that even if the process crashes immediately after acknowledging a write, the operation can be recovered from the log:

fn append_wal(&mut self, entry: &str) {
    let mut file = OpenOptions::new()
        .create(true)
        .append(true)
        .open(&self.wal_path)
        .expect("Failed to open WAL");
    writeln!(file, "{}", entry).expect("Failed to write to WAL");
    self.operations_since_snapshot += 1;
}

pub fn put(&mut self, key: String, value: String) {
    self.append_wal(&format!("PUT {}={}", key, value));
    self.data.insert(key, value);
}

pub fn delete(&mut self, key: &str) {
    self.append_wal(&format!("DELETE {}", key));
    self.data.remove(key);
}

Recovery reconstructs the in-memory state by first loading the most recent snapshot (if one exists) and then replaying any WAL entries that were written after the snapshot. This two-phase recovery ensures both speed (loading a snapshot is faster than replaying the entire history) and completeness (the WAL captures any operations since the last snapshot):

fn recover(&mut self) {
    // First, load snapshot if it exists
    if self.snapshot_path.exists() {
        if let Ok(file) = File::open(&self.snapshot_path) {
            let reader = BufReader::new(file);
            for line in reader.lines() {
                if let Ok(line) = line {
                    let parts: Vec<&str> = line.splitn(2, '=').collect();
                    if parts.len() == 2 {
                        self.data.insert(
                            parts[0].to_string(),
                            parts[1].to_string(),
                        );
                    }
                }
            }
        }
    }

    // Then, replay WAL on top
    if self.wal_path.exists() {
        if let Ok(file) = File::open(&self.wal_path) {
            let reader = BufReader::new(file);
            for line in reader.lines() {
                if let Ok(line) = line {
                    self.apply_wal_entry(&line);
                }
            }
        }
    }
}

Over time, the WAL grows as operations accumulate. Compaction solves this by writing a snapshot of the current in-memory state and then truncating the WAL. A background task triggers compaction after a threshold number of operations:

pub fn compact(&mut self) {
    if self.operations_since_snapshot < COMPACTION_THRESHOLD {
        return;
    }

    // Write snapshot
    let mut file = File::create(&self.snapshot_path)
        .expect("Failed to create snapshot");
    for (key, value) in &self.data {
        writeln!(file, "{}={}", key, value)
            .expect("Failed to write snapshot");
    }

    // Truncate WAL
    File::create(&self.wal_path).expect("Failed to truncate WAL");
    self.operations_since_snapshot = 0;
}

The scan operation supports prefix-based range queries, sorting results lexicographically and respecting a limit parameter to prevent unbounded responses:

pub fn scan(&self, prefix: &str, limit: i32) -> Vec<(String, String)> {
    let mut results: Vec<(String, String)> = self.data.iter()
        .filter(|(k, _)| k.starts_with(prefix))
        .map(|(k, v)| (k.clone(), v.clone()))
        .collect();
    results.sort_by(|a, b| a.0.cmp(&b.0));
    if limit > 0 {
        results.truncate(limit as usize);
    }
    results
}

Design Discussion

The write-ahead log pattern provides a strong durability guarantee: data is on disk before the write is acknowledged. However, our implementation writes to a single file on a single disk. A production storage service would replicate data across multiple servers to survive disk and machine failures. Techniques like chain replication or consensus-based replication (using a protocol like Raft or Paxos) can provide this.

The in-memory HashMap provides O(1) reads but limits the total data size to available memory. For larger datasets, storage engines use on-disk data structures like LSM trees (Log-Structured Merge trees) or B-trees that can store far more data than fits in memory while still providing fast reads through caching and indexing.

Compaction is a form of garbage collection for the WAL. Without it, the log would grow without bound, making recovery slower over time. The compaction threshold of 1,000 operations represents a trade-off: more frequent compaction keeps the WAL small but consumes more I/O; less frequent compaction reduces I/O but risks longer recovery times.

The scan operation's prefix-based filtering is a versatile primitive. It enables patterns like namespacing (all keys for a user under user:{id}:), range queries (all entries in a time window), and enumeration (listing all keys in a collection). Our highlight system uses this pattern to store per-user, per-page highlights under keys like hl:{user_id}:{page_slug}.