Chapter 35: Maintenance

Maintenance is the ongoing work required to keep the planetary scale computer healthy. Unlike a personal computer that can be taken offline for maintenance, a planetary scale computer must be maintained while it continues to serve users. This requires careful coordination between maintenance activities and the services running on the infrastructure.

Planned maintenance includes hardware replacement (swapping failed components), software updates (operating system patches, firmware updates), and capacity expansion (adding new servers and racks). Each maintenance activity must be scheduled to minimize impact on running services, using the scheduling and placement systems to move work away from servers that need maintenance.

Unplanned maintenance — responding to unexpected failures — is the more challenging scenario. When a server fails unexpectedly, the discovery and monitoring services detect the failure, the routing service stops sending traffic to it, and the consensus system elects a new leader if the failed server was one. The failed server is then repaired or replaced as part of the normal maintenance cycle.