Outlook.com left preview a few weeks ago, and as part of that, we shared that we’d start to upgrade the hundreds of millions of people using Hotmail to the new, modern Outlook.com experience. We had done multiple pilots during the preview period and learned a ton. Overall, the upgrade has been going very well–people have upgraded much faster than we had expected. The vast majority of people using our services have had a smooth experience during this time and are enjoying the new Outlook.com experience. That said, we had an issue yesterday and wanted to provide you with a deeper look at what happened.
Before we dive into the details, we do want to sincerely apologize to anyone that was unable to access their email during the interruption. Outages are something we take very seriously and invest a significant amount of our time and energy in doing our best to prevent.
Root cause analysis
At 13:35 PM PDT on March 12th, 2013 there was a service interruption that affected some people’s access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on March 13th, 2013.
On the afternoon of the 12th, in one physical region of one of our datacenters, we performed our regular process of updating the firmware on a core part of our physical plant. This is an update that had been done successfully previously, but failed in this specific instance in an unexpected way. This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter.
These safeguards prevented access to mailboxes housed on these servers and also prevented any other pieces of our infrastructure to automatically failover and allow continued access. This area of the datacenter houses parts of the Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying to access those services were impacted.
Details of impact and restoration
Once the safeguards kicked in on these systems, the team was instantly alerted and they immediately began to get to work to restore access. Based on the failure scenario, there was a mix of infrastructure software and human intervention that was needed to bring the core infrastructure back online. Requiring this kind of human intervention is not the norm for our services and added significant time to the restoration.
From that point onward, the team brought back access in waves throughout the evening. The majority of the impacted mailboxes were fully restored before midnight and the rest completed by 5:30 AM.
We hope this helped provide an understanding of the incident and again, we sincerely apologize and regret the impact this outage had on all of you. Now that we’re through the resolution, we’re also hard at work on ensuring this doesn’t happen again.
https://status.live.com is always the best and most reliable way to get real time information specific to any service issues that we are encountering, and when you are signed in, is customized based on the health of your specific account.
—Arthur de Haan, Vice President