Constant Time Recovery in Azure SQL Database

  • Panagiotis Antonopoulos ,
  • Peter Byrne ,
  • Wayne Chen ,
  • Cristian Diaconu ,
  • Raghavendra Thallam Kodandaramaih ,
  • Hanuma Kodavalla ,
  • Prashanth Purnananda ,
  • Adrian-Leonard Radu ,
  • Chaitanya Sreenivas Ravella ,
  • Girish Mittur Venkataramanappa

Azure SQL Database and the upcoming release of SQL Server introduce a novel database recovery mechanism that combines traditional ARIES recovery with multi-version concurrency control to achieve database recovery in constant time, regardless of the size of user transactions. Additionally, our algorithm enables continuous transaction log truncation, even in the presence of long running transactions, thereby allowing large data modifications using only a small, constant amount of log space. These capabilities are particularly important for any Cloud database service given a) the constantly increasing database sizes, b) the frequent failures of commodity hardware, c) the strict availability requirements of modern, global applications and d) the fact that software upgrades and other maintenance tasks are managed by the Cloud platform, introducing unexpected failures for the users. This paper describes the design of our recovery algorithm and demonstrates how it allowed us to improve the availability of Azure SQL Database by guaranteeing consistent recovery times of under 3 minutes for 99.999% of recovery cases in production.