Hi folks, Eric Bidstrup here.


As I touched on in my December posting on Common Criteria, and as Michael Howard discussed in his post on security metrics, trying to objectively quantify and measure “How secure is secure” is far more difficult than one might think. I’d like to share my perspective that there are two “dimensions” useful to consider when characterizing software security metrics: security functional requirements and security engineering quality requirements. While the SDL is focused primarily (but not exclusively) on the latter, both are ultimately important when assessing the security of a given bit of software. However, for reasons I’ll elaborate on below, the SDL does focus on trying to prevent the most common causes of vulnerabilities today and hence looking at the ways in which Microsoft tracks and measures individual products teams’ compliance with SDL requirements offers some interesting fodder for the security metrics debate. I’m not offering a complete solution, but am sharing our experience at Microsoft with measuring how development teams actually follow the SDL. It’s helped us deliver more secure software, and sharing this will hopefully help others as well as putting more data on the table for consideration when discussing security metrics.


Putting aside computer security for just a moment, it’s interesting to look at other ways in which we attempt to measure security in our society. Padlocks offer security protections, and organizations such as the American Standard for Testing and Materials (ASTM) provide standards like F883-04 Standard Performance Specification for Padlocks that characterize padlock security ratings. Prisons provide security protections as well. Prisoners reside in different facilities that vary by security level. The US Bureau of Prisons uses a numbered scale from one to six to represent the security level. Both of these examples are similar in that the threats and risks each of them must protect against are reasonably well understood and relatively static (meaning the threats don’t change much over time). Computer security is still evolving with new classes of attacks still being discovered, and while hackers understand how to exploit known types of vulnerabilities – software developers are still catching up in learning how to modify engineering practices to be resilient against both new and old types of attacks. Hence, metrics are more challenging for computer security.


Several attempts have been made by governments to come up with a security rating system similar to the examples listed above. In the 1980’s, the US Department of Defense created the “Trusted Computer System Evaluation Criteria (TCSEC)” that tried to establish a standard for measure operating system security. The “Orange Book” offered a relatively simple system for assigning “score” summarized below:

D (Minimal Protection)

C (Discretionary Protection)

C1: Discretionary Security Protection

C2: Controlled Access Protection

B (Mandatory Protection)

B1: Labeled Security Protection

B2: Structured Protection

B3: Security Domains

A (Verified Protection)

A1: Verified Design



In the 1990’s, the US and other nations combined their efforts to create an international security standard for software known as the Common Criteria (ISO 15408). Common Criteria also has a rating system that scores products with “evaluation assurance levels” (EALs):


EAL 1: Functionally Tested

EAL 2: Structurally Tested 

EAL 3: Methodically Tested and Checked 

EAL 4: Methodically Designed, Tested, and Reviewed 

EAL 5: Semi-formally Designed and Tested 

EAL 6: Semi-formally Verified Design and Tested 

EAL 7: Formally Verified Design and Tested 


Both TCSEC and Common Criteria (CC) are primarily focused on “security functional requirements” (as called out earlier, distinct from “security engineering quality requirements”). The EALs reflect the amount of rigor and attention to claimed security functional requirements a developer applied while creating a product. Furthermore, the EALs also reflect increasing levels of effort and resources necessary by anyone reviewing a product in order to evaluate the product’s claimed security functional requirements. However, EAL ratings for commercial products have historically not correlated with the number of vulnerabilities found in commercial products after release. As I discussed in my December posting on Common Criteria, this is because CC is primarily focused on “security functional requirements” and fails to adequately address “security engineering quality requirements”.  This leads a question on how to measure those aspects of software security that earlier efforts have been unable to successfully address. 


Microsoft has been releasing security bulletins since 1999. Based on some informal analysis that members of our organization have done, we believe well over 50% of *all* security bulletins have resulted from implementation vulnerabilities and by some estimates as high as 70-80%. (Some cases are questionable and we debate if they are truly “implementation issues” vs. “design issues” – hence this metric isn’t precise, but still useful). I have also heard similar ratios described in casual discussions with other software developers. In other words, most vulnerabilities can be addressed by the “security engineering quality requirements” described via SDL. This is not to say that “security functional requirements” are unimportant or that SDL ignores secure design (as Adam has described in his threat modeling series), but rather that it is not where vulnerabilities are being most frequently encountered. With SDL, we adopt a pragmatic approach in looking at identifying the root causes of security vulnerabilities, and trying to prevent those root causes from reoccurring. The challenge lies in how we actually validate that development teams are indeed adopting and executing whatever changes SDL requires in engineering (either in terms of process or tools). Process changes are often difficult to quantify, as we must rely upon development teams truthfully attesting they have followed the process. As long as development teams believe the process results in better code, they generally will adopt and follow such practices. Tool usage becomes more interesting and valuable in that using tools becomes a vehicle for objectively and independently verifying if code satisfies requirements or not. But that is just the tip of the iceberg…


As I said above in my comments on EALs, the amount of time required by anyone reviewing a product to assess “security” is relevant since security review can be a very time and resource intensive activity. However, running static code analysis tools, verifying build tools and switches, searching for banned APIs, and recording the output of other tools that inspect code and/or binaries for potential implementation vulnerabilities is a key element in how we approach the challenge of trying to measure compliance with SDL requirements from product groups at Microsoft today. While not every technique required by SDL has a corresponding tool, we try to provide both tools and automation if and wherever possible. There is still much work to be done in terms of standardizing tool output formats and creating automation to assess tool output. However, these “grass roots” metrics derive from practical experience of changing engineering requirements based on actual vulnerabilities. We look objectively at what is causing vulnerabilities, and target solutions to address the root causes of those issues. As the saying goes, “If it hurts when you do that, stop doing that”. If what we have done in the past has hurt our customers by creating vulnerabilities requiring security bulletins, we want to stop doing that. J


The challenge in using a plethora of individual detailed metrics such as I describe above (that we do internally at Microsoft for measuring SDL compliance), is that they don’t roll up into a nice aggregate score that customers can easily understand.  However, they have translated into reduced numbers of vulnerabilities as Michael Howard wrote a few weeks ago. Coupling these types of scores with assessment of compliance with “security functional requirements” might be the basis for coming up with a metric that is useful to customers, both in the government and private sector.


What do you think?