The cost and complexity of adminstration of large systems has come to dominate their total cost of ownership. Stateless and soft-state components, e.g. Web servers or network routers, are easy to manage: capacity can be scaled incrementally by adding more nodes, rebalancing of load after failover is easy, and reactive or proactive (“rolling”) reboots can be used to handle transient failures. We show that it is possible to achieve the same ease of management for the state-storage subsystem by subdividing persistent state according to the specific guarantees needed by each type. While other systems have addressed persistent-until deleted state, we describe SSM, a store for a previously unaddressed class of state–user-session state–that exhibits the same manageability properties as stateless nodes while providing firm storage guarantees. ANy node can be proactively or reactively rebooted at any time to recover from transient faults, without impacting online performance or losing data. We exploit this simplified manageability by pairig SSM with an application-generic, statistical-anomaly-based framework that detects crashes, hangs, and performance failures, and automatically attempts to recover from them by rebooting faulty nodes. ALthough the detection techniques generate some false positives, the cost of recovery is so low that the false positives have low impact. We provide microbenchmarks to demonstrate SSM’s built-in overload protection, failure management and self-tuning. We benchmark SSM integrated into a production enterprise-scale interactive service to demonstrate that these benefits need not come at the cost of significantly decreased throughput or response time.