Chapter 19: Planning
Capacity planning is the practice of predicting future resource needs and ensuring that infrastructure is available before demand arrives. Running out of capacity is a crisis; having too much capacity is waste. Good planning threads the needle between these extremes.
Planning starts with data: historical utilization trends from the monitoring service, growth projections from product teams, and knowledge of upcoming launches or events that might cause traffic spikes. Simple trend extrapolation (if storage is growing 5% per month, we'll need another server in six months) works for steady growth. Step functions (a product launch that doubles traffic overnight) require more explicit planning.
Lead times are critical. If ordering new hardware takes three months and traffic is growing 10% per month, you must order hardware when you are at 70% capacity — not when you run out. Cloud infrastructure shortens lead times (new servers in minutes instead of months) but introduces its own planning challenges around cost management and reserved capacity.
The best capacity plans include a buffer for the unexpected: a traffic spike from a viral event, a hardware failure that reduces effective capacity, or a dependency that becomes slower and backs up queues. A common rule of thumb is to keep 20-30% headroom above expected peak utilization.