Measure Twice, Cut Once, With RMA Methodology

I’ve been beating our drum for a while now about the inevitability of failure in cloud-based systems. Simply put, the complexities and interdependencies of the cloud make it nearly impossible to avoid service failure, so instead we have to go against our instincts and actually design for this eventuality.

Once you accept this basic premise, the next question is how exactly do we need to change our design processes? The Resilience Modeling and Analysis (RMA) methodology is a key part of the answer.

RMA brings the master carpenter’s “measure twice, cut once” philosophy to engineering. The goal is to help ensure teams think through as many of the potential reliability-related issues as possible before committing code to production—not to prevent every single failure mode, but to limit the impact a failure could have on customers if they occur. Read more >>

Read more Measure Twice, Cut Once, With RMA Methodology

Designing for Failure: The Changing Face of Reliability

I’ve written about reliability and resilience before, but the topic is so important it’s worth revisiting again, using an example from the real world I think you’ll appreciate.

Imagine the pressure the architects and engineers were under when they designed and built the Channel Tunnel connecting England to France via rail. The so-called “Chunnel” would have to transport — safely —millions of people a year at speeds over 160 kilometers per hour, across 37.9 undersea kilometers.

With so many lives at stake, the designers had to eliminate all possibility of failure. Wrong. In fact, in building the Channel Tunnel, the designers expected failure of individual components. That’s why they built three interconnected tunnels: two of them to accommodate rail traffic, and one in the middle for maintenance, but also to serve as an emergency escape route, if needed.  See more >>

Read more Designing for Failure: The Changing Face of Reliability

Antifragility – the goal for high-performance IT organizations

In a recent post, I shared a short list of my favorite books and articles, related to reliability. Each one has influenced my thinking with respect to how to go about creating a high-performing IT organization, despite the fact not all of these publications are IT-centric in terms of subject matter. In this post, I’m going to take a closer look at “Antifragile”, the 2012 book written by Nassim Nicholas Taleb, and describe why I think the concept of antifragility is particularly applicable to cloud computing. See more >>

Read more Antifragility – the goal for high-performance IT organizations

My “Desert Island Half-Dozen” – recommended reading for resilience

Read more My “Desert Island Half-Dozen” – recommended reading for resilience

Reliability Series #4: Reliability-enhancing techniques (Part 2)

In my previous post in this series, I discussed the Discovery and Authorization/Authentication categories of the “DIAL” acronym to share mitigations targeting specific failure modes. In this article I’ll discuss the “Limits/Latency” and “Incorrectness” categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes for each.  See more >>

Read more Reliability Series #4: Reliability-enhancing techniques (Part 2)

Reliability Series #3: Reliability-enhancing techniques (Part 1)


In my
previous post, I discussed “DIAL”, an approach we use to categorize common service component interaction failures when applying Resilience Modeling & Analysis, (RMA), to an online service design.  In the next two posts,  I’ll discuss some mitigation strategies and design patterns intended to reduce the likelihood of the types of failures described by “DIAL”.  See more >>

Read more Reliability Series #3: Reliability-enhancing techniques (Part 1)

Reliability Series #2: Categorizing reliability threats to your service

Read more Reliability Series #2: Categorizing reliability threats to your service

Reliability Series #1: Reliability vs. resilience

Read more Reliability Series #1: Reliability vs. resilience

Want more information on Trustworthy Computing? Check out our other blogs

Read more Want more information on Trustworthy Computing? Check out our other blogs

Suggested Resolutions for Cloud Providers in 2014 #1: Reinforce that security is a shared responsibility

Read more Suggested Resolutions for Cloud Providers in 2014 #1: Reinforce that security is a shared responsibility