Improving the Reliability of Commodity Operating Systems


April 13, 2005


Michael Swift


University of Washington


Despite decades of research in fault tolerance, commodity operating systems, such as Windows and Linux, continue to crash. In this talk, I will describe a new reliability subsystem for operating systems that prevents the most common cause of crashes, device driver failures, without requiring changes to drivers themselves. To date, the subsystem has been used in Linux to prevent system crashes in the presence of driver failures, recover failed drivers transparently to the OS and applications, and update drivers “on the fly” without requiring a system reboot after installation. Measurements show that the system is extremely effective at protecting the OS from driver failures, while imposing little runtime overhead.


Michael Swift

Mike Swift grew up in Amherst, Massachusetts and received a B.A. from Cornell University in 1992. After college, he worked at Microsoft in the Windows group, where he implemented authentication and access control functionality in Windows Cairo, Windows NT, and Windows 2000. Since 1998, he has been a graduate student at the University of Washington working with Professors Hank Levy and Brian Bershad. At UW, he has studied large-scale clusters, simultaneous multithreading and operating system reliability.