The Code That No One in the Cloud Can Live Without
A couple of years ago, a few Microsoft researchers published a couple of interesting papers on storage efficiencies. Now, with breathtaking speed, the concepts in those papers have been embraced across the cloud-computing world.
In November 2012, Gopalan, Huang, and Yekhanin, along with Huseyin Simitci of Windows Azure Storage (now Microsoft Azure Storage), had their paper On the Locality of Codeword Symbols, published in IEEE Transactions on Information Theory.
Erasure-Coding Theory Paper Gains Acclaim
During ISIT 2014, the IEEE International Symposium on Information Theory, being held June 29-July 4 in Honolulu, the authors of that paper received the IEEE Communications Society & Information Theory Society Joint Paper Award. The honor goes to outstanding papers published in a publication of the Communications Society or the Information Theory Society within the previous three calendar years.
The winning paper is an in-depth theoretical study of relations between code parameters needed for data-storage applications. Erasure Coding in Windows Azure Storage—an earlier systems paper written by Microsoft’s Huang, Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Gopalan, Jin Li, and Yekhanin—pointed the way for a new method to achieve more efficient storage in the cloud. It, too, garnered plenty of attention, winning a best-paper award during the 2012 USENIX Annual Technical Conference. The co-authors of the Erasure Coding paper also earned a Microsoft Technical Community Network Storage Technical Achievement Award in 2013 for outstanding achievement and contribution to Microsoft software technology.
“The project started with Cheng and Jin Li having the idea that Azure might benefit from adopting some new kind of erasure codes,” Yekhanin says. “Parikshit and I joined the team. Together, we developed an abstract mathematical framework that captures the requirements for codes that arise in distributed storage applications.
“The key differences from classical coding-theory setup is that here, we want codes that provide ‘locality’—the ability to recover lost data quickly in typical failure scenarios. We designed codes, later adopted by Azure, and also proved that our codes are optimal in a certain strict mathematical sense.”
Gopalan, who will be traveling to Honolulu to accept the Joint Paper Award, provides a bit more detail.
“This is very much a theory paper,” he says. “Its main contribution is a lower bound, saying that any code with certain properties needs to have certain length. But it arose from a very practical setting: We were trying to show that a particular code we had suggested to Azure was optimal.
“It is gratifying that a lower bound—a result saying that you cannot do any better than something—should have such practical significance. The results there turn out to have real predictive power in telling us what properties are achievable by code and what are not. It certainly helped in our interactions with product groups to be able to say that our constructions are provably optimal.”
Local Reconstruction Codes
Erasure coding, a powerful math tool that reduces the space required to store data, relies on shortened descriptions of data for reassembly and delivery to users. Local Reconstruction Codes (LRC) enable quicker data reconstructions, and the result is reduced time and costs for data retrieval.
Local Reconstruction Codes have been adopted throughout all Microsoft storage production lines, from the cloud to enterprise and the desktop. It was first deployed in Azure Storage in 2012, to great acclaim. In 2013, LRC also shipped with Windows Server 2012 R2 and Windows 8.1.
Huang cites great collaboration with partners from Windows storage teams.
“We are very lucky to work with fantastic business partners, the Azure Storage team and the Windows Storage Spaces team. LRC wouldn’t have gone anywhere without them taking a leap of faith and making their contributions.”
At first, erasure coding was a solution for a problem that didn’t exist. But when it eventually did …
“It took a long time to bear fruit, from research to production,” Huang says. “When we started exploring this direction and published earlier papers in 2007, there was very little interest from business groups. Literally, every team we talked to told us that disks were getting bigger and cheaper every day. Redundancy could be easily achieved with replication, and there was no need to bother with erasure coding.
“It is very telling to see how cloud computing has completely turned the world around in several years. Now, the industry is at a point that no one in the cloud business—not only Microsoft, but also Amazon, Facebook, Google, and others—can be competitive or even survive without erasure coding.”