Scott Lambert here. I work on the Security Engineering Tools team where we’re responsible for researching, developing and publishing tools to internal product and service teams. These include fuzzing, binary analysis and attack surface analysis tools.
Previously, James Whittaker posted a blog entry on Testing in the SDL in which he mentioned that many folks equate fuzz testing with security testing. While fuzz testing doesn’t come close to describing how security testing is done at Microsoft it does happen to be one of our most scalable testing approaches to detecting program failures that may have security implications.
As Michael Howard has pointed out before, we do our best to ensure that the SDL incorporates lessons learned from vulnerabilities that required us to release security updates. It turns out that the animated cursor bug patched in MS07-017 had a positive impact on the automatic triaging our fuzz testing tools perform. In this post, I’d like to shed some light on how we monitor for program failures when fuzzing parsers and how the recent animated cursor bug, MS07-017 caused us to revisit and ultimately improve our fuzzing tools.
For our purposes, fuzz testing is a method for finding program failures (code errors) by supplying malformed input data to program interfaces (entry points) that parse and consume this data (e.g. file, network, registry, shared memory parsers). At Microsoft, we view fuzz testing as six distinct stages in which the output of each stage can impact or influence both the current and next iteration through the stages (e.g. after completing analysis work in stage 5 you could decide to change how you malform and deliver fuzzed data [stage 2 and 3], which exceptions get logged [stage 4], which tests you re-run [stage 6] and even which parsers you might decide to go after next [stage 1], etc). Below is a brief listing of each stage and its associated tasks.
Stage 1: Prerequisites
- Identifying the targets (program interfaces to fuzz)
- Prioritizing your efforts (test planning)
- Setting Bug Bar
Stage 2: Creation of fuzzed data (malformed data)
- Will we be format-aware (e.g. most files follow a format)? Context-aware (e.g. order and/or timing of data may be important)?
- Will we use existing data (mutation) or generate it from scratch (generation)?
- Will the malformations we apply be based on type? Use interesting patterns? Over how many bits/bytes?
- Will we apply malformations with or without restriction? Are we going to be deterministic or random or both? How many times in a single iteration do we apply any given malformation?
Stage 3: Delivery of fuzzed data to the application under test
- Determining the best method to get the application under test to consume the fuzzed data (e.g. load path from cmd-line or GUI; API hooking; MITM proxies; DLL redirection; in-memory start-stop-rewind, etc)
- Implementing the appropriate delivery mechanism and conducting the test
Stage 4: Monitoring of application under test for signs of failure
- What should we look for?
- What do we do when we see it?
Stage 5: Triaging Results
- How can we classify and analyze issues found?
Stage 6: Identify root cause, fix bugs, rerun failures, analyze coverage data (rinse and repeat)
How we do file fuzzing
There are a number of approaches taken by product teams to meet the SDL file fuzzing requirements. They often include the use of generation and mutation-based fuzzers as well as a combination of multiple internal and externally available fuzzing tools and/or frameworks.
When fuzzing file parsers, we monitor for both handled and unhandled exceptions in the application under test. Exceptions are events that typically represent error conditions encountered during the execution of an application. They can be generated both by the hardware (initiated by the CPU) and/or software (initiated by the executing program or the OS). To monitor for these exceptions, we created a mini-debugger using the Win32 Debugging APIs (For an example of how to integrate a debugger into your fuzz testing tool, check out Michael Howard and Steve Lipner’s SDL Book at http://www.microsoft.com/MSPress/books/8753.asp). The mini-debugger launches the application under test and monitors the parent and all subsequent child processes and associated threads. When an exception occurred, the first version of this tool simply logged the file that caused the exception along with associated details such as the timestamp, exception code, exception address, stack trace and dump file. More recent versions have included the ability to monitor for CPU and memory spikes as well as enabling full page heap settings on all processes launched from the mini-debugger.
As a general rule, all exceptions must be triaged (reviewed) by the tester to determine if a bug needs to be filed. When fuzzing over a period of time however, we might generate hundreds of exceptions and it becomes a very labor-intensive process to sift through all of them. What we needed was a way to ease the burden placed on the tester.
To that extent, the mini-debugger was extended to enable the automatic “bucketization” of logged exceptions to reduce the chance of having to look at duplicates during the triaging process. This was accomplished by creating unique bucket ids calculated from the stack trace using both symbols and offset when the information is available. The bucket id was used to name a folder that was created in the file system to refer to a unique application exception. When an exception occurred, we calculated a hash (bucket id) of the stack trace and determined if we had already seen this exception. If so, we logged the associated details in a sub-directory under the bucket id folder to which the exception belonged. The sub-directory name was created from the name of the fuzzed file that caused the exception. Thus, we were able to reduce the number of potential exceptions that a tester would have to look at during the triage process. It is often the case that certain exceptions are noisy and/or expected so we also added the ability for the tester to dampen exceptions by exception code. Dampening ensured that those exceptions were not logged (recorded) for triage during a fuzz run. Nonetheless, despite our best efforts it is still possible for two different stack traces to have the same underlying root cause.
Even with all of this automated assistance, the tester might still have several hundred cases to triage. In an effort to prioritize which cases should be triaged first, we introduced the notion of classifying exceptions. Again, we extended the mini-debugger to perform classification on the exception code and relevant details. In particular, we added an extra hierarchy over the automatically generated directory structure described above. To do this we introduced the following categories of exceptions:
- Must Fix
- Further Investigation necessary
- Usually not exploitable
I know what you’re thinking, but remember that this classification doesn’t exclude a tester from the requirement of having to triage all exceptions. The “Must Fix” category was composed of write access violations, read access violations on EIP, /GS and NX related access violations and read access violations where any one of the following was true*:
- The access violation happens on a rep assembly instruction (on an Intel processor) where the count register (ecx) is large.
- The access violation happens on a mov instruction where the result is used as the destination of a call in the instructions immediately after the mov.
- The access violation happens on a mov instruction where the result is later used in a rep instruction as the source (esi), destination (edi) or count (ecx).
*Fully automating the classification of these cases is complex and almost always requires an entire execution trace. As such, teams are also provided with guidance to assist them during their analysis when our tool is unable to classify beyond “read and write access violations”.
The “Further Investigation necessary” category was composed of read access violations that didn’t meet the criteria above as well as other specific cases. Finally, the “Usually not exploitable” category was composed of other exceptions such as divide-by-zero, C++ exceptions and the like. Another thing to keep in mind is that the interpretation of “Usually not exploitable” is different for server-based components. In other words, a divide-by-zero exception in a server product is probably more than just a robustness issue…it might be a denial of service!
Remember that regardless of this classification the tester is still required to triage all exceptions and file bugs accordingly. I’ll defer more details on the subject of exploitability of program failures to the upcoming annual security issue of MSDN Magazine in November.
To recap, we had a debugging plug-in (mini-debugger) that not only monitored for exceptions but also reduced the number of exceptions to triage after a fuzzing session was completed. This also included monitoring for CPU and memory spikes as well as the use of page heap to capture heap corruptions that might not manifest themselves as an application crash (exception) during the fuzz session. What could go wrong? Enter MS07-017. The software responsible for invoking the vulnerable code [to parse animated cursors] made use of an exception handler to recover from pretty much any exception that could be generated and continue operating as if nothing had occurred (Read more about it at http://blogs.msdn.com/sdl/archive/2007/04/26/lessons-learned-from-the-animated-cursor-security-bug.aspx).
The Animated Cursor bug caused us to revisit our mini-debugger. Why? Put simply, we hadn’t introduced the “bucketization” and classification mechanisms for first-chance exceptions. Naturally, this meant the tester was back to square one in terms of having no assistance on the labor-intensive triaging process. To deal with the “recover from anything” exception handling code we introduced the concept of classifying and bucketing “dangerous” first chance exceptions to help reduce the number of first chance cases the tester would need to triage. This means we look for both write access violations and read access violations on EIP. Additionally, we added support to continue after a first chance exception, allowing exception handlers to be called and continue and possibly proceed on to other more interesting crashes.
As you can see fuzz testing scales pretty well, but simplifying and scaling the triage process is not an easy task. Even more challenging is the integration of technology into an effective lifecycle. We’re constantly working with teams within Microsoft to further advance our tools, you can learn more by viewing http://research.microsoft.com/research/pubs/view.aspx?id=1333&type=Technical+Report and http://research.microsoft.com/Pex/.