I’ve been beating our drum for a while now about the inevitability of failure in cloud-based systems. Simply put, the complexities and interdependencies of the cloud make it nearly impossible to avoid service failure, so instead we have to go against our instincts and actually design for this eventuality.
Once you accept this basic premise, the next question is how exactly do we need to change our design processes? The Resilience Modeling and Analysis (RMA) methodology is a key part of the answer.
RMA brings the master carpenter’s “measure twice, cut once” philosophy to engineering. The goal is to help ensure teams think through as many of the potential reliability-related issues as possible before committing code to production—not to prevent every single failure mode, but to limit the impact a failure could have on customers if they occur. Read more >>
I’ve written about reliability and resilience before, but the topic is so important it’s worth revisiting again, using an example from the real world I think you’ll appreciate.
Imagine the pressure the architects and engineers were under when they designed and built the Channel Tunnel connecting England to France via rail. The so-called “Chunnel” would have to transport — safely —millions of people a year at speeds over 160 kilometers per hour, across 37.9 undersea kilometers.
With so many lives at stake, the designers had to eliminate all possibility of failure. Wrong. In fact, in building the Channel Tunnel, the designers expected failure of individual components. That’s why they built three interconnected tunnels: two of them to accommodate rail traffic, and one in the middle for maintenance, but also to serve as an emergency escape route, if needed. See more >>
In a recent post, I shared a short list of my favorite books and articles, related to reliability. Each one has influenced my thinking with respect to how to go about creating a high-performing IT organization, despite the fact not all of these publications are IT-centric in terms of subject matter. In this post, I’m going to take a closer look at “Antifragile”, the 2012 book written by Nassim Nicholas Taleb, and describe why I think the concept of antifragility is particularly applicable to cloud computing. See more >>
When I speak with customers, they often ask how they can successfully change the culture of their IT organization when deciding to implement a resilience engineering practice. Over the past decade I’ve collected a number of books and articles which I have found to be helpful in this regard, and I often recommend these resources to customers. I’ve included my favorites below, in no particular order, with a short explanation of why I’m recommending them. See more>>
In my previous post in this series, I discussed the Discovery and Authorization/Authentication categories of the “DIAL” acronym to share mitigations targeting specific failure modes. In this article I’ll discuss the “Limits/Latency” and “Incorrectness” categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes for each. See more >>
In my previous post, I discussed “DIAL”, an approach we use to categorize common service component interaction failures when applying Resilience Modeling & Analysis, (RMA), to an online service design. In the next two posts, I’ll discuss some mitigation strategies and design patterns intended to reduce the likelihood of the types of failures described by “DIAL”. See more >>
Online services face ongoing reliability-related threats represented by device failures, latent flaws in software being triggered by environmental change, and mistakes made by human beings. At Microsoft, one of the ways we’re helping to improve the reliability of our services is by investing in resilience modeling and analysis (RMA) as a way for online service engineering teams to incorporate robust resilience design into the development lifecycle. See more>>
Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization – or a provider of a service – you want your customers to carry out the tasks they want to, whenever they want to do so.
This article is the first in a four-part series on building a resilient service. In my first two posts, I will discuss the topic as it relates to business strategy, and then we'll dive deeper into the technical details. See more >>
The Trustworthy Computing blog covers Microsoft’s perspective on security, privacy, online safety, and reliability, especially as they relate to the cloud.
For readers who want additional information on those topics, check out our other TwC Blogs, which provide insights from Microsoft experts, plus information on mitigation tools, secure development, security updates, online safety, and more. Read more >>
Suggested Resolutions for Cloud Providers in 2014 #1: Reinforce that security is a shared responsibility
Happy 2014! The arrival of a new year is always a great time to reflect on where you’ve been over the past 12 months, and more importantly, where you are headed. I was recently asked to share some New Year’s Resolutions for cloud providers for an article in Security Week and I thought I’d expand a bit more on those and share them with you.
Let’s start with Suggested Resolution #1: Reinforce that security is a shared responsibility.