By Shaz Qadeer, Principal Researcher
The complexity of emerging software applications requires new approaches to understanding and then efficiently building, testing and debugging these systems. Today’s software uses cloud resources, is often embedded in devices in the physical world and employs artificial intelligence techniques. These three factors make today’s software systems an order of magnitude more difficult to develop.
Often, these modern applications feature asynchrony, which happens when, to improve performance, the requestor of an operation continues without requiring the completion of the operation. Asynchrony inevitably leads to concurrency, with its notorious pitfalls of race conditions and Heisenbugs (software bugs, often timing-related, that seems to disappear when investigated, due to changes resulting from the investigation). To address the challenges of asynchronous computation, we have developed P, a programming language for modeling and specifying protocols in asynchronous event-driven applications. This project is a collaborative effort between Microsoft researchers and engineers, and academic researchers at the University of California, Berkeley and Imperial College in London.
The P programmer writes the protocol and its specification at a high level. The P compiler provides automated testing for concurrency-related race conditions and executable code for running the protocol. P provides first-class support for modeling concurrency, specifying safety and liveness properties and checking that the program satisfies its specification using systematic search. In these capabilities, it is similar to Leslie Lamport’s TLA+ and Gerard Holzmann’s SPIN. Unlike TLA+ and SPIN, a P program can also be compiled into executable C code. This capability bridges the gap between high-level model and low-level implementation and eliminates a huge hurdle to the acceptance of formal modeling and specification among programmers.
The programming model in P is based on concurrently executing state machines communicating via events, with each event accompanied by a typed payload value. A memory management system based on linear typing and unique pointers provides safe memory management and data-race-free concurrent execution. In this respect, P is similar to modern systems programming languages such as Rust.
P got its start in Microsoft software development when it was used to ship the USB 3.0 drivers in Windows 8.1 and Windows Phone. These drivers handle one of the most important peripherals in the Windows ecosystem and run on hundreds of millions of devices today. P enabled the detection and debugging of hundreds of race conditions and Heisenbugs early on in the design of the drivers, and is now extensively used for driver development in Windows. Early positive experience with P in the Windows kernel led to the development of P#, a framework that provides state machines and systematic testing via an extension to C#. In contrast to P, the approach in P# is minimal syntactic extension and maximal use of libraries to deliver modeling, specification and testing capabilities.
P is transforming the development of cloud infrastructure in Azure. Azure, similar to other cloud providers, faces the challenge of Heisenbugs caused by unexpected race conditions and software or hardware faults. These bugs result in disruption of live services — a huge problem for both customers and providers of cloud services. P and P# are being used to find and debug Heisenbugs in already-deployed services and to design and validate new services before deployment. P allows engineers to precisely model asynchronous interfaces among components in a large Azure service. It also allows engineers to discover and debug problems on their desktops that would otherwise take months and sometimes even years to manifest after the service is deployed.
An important feature of P that makes it particularly suitable for validating fault-tolerant distributed services is the ability to perform thorough failover testing, that is, validating that the service recovers and resumes operation when an unexpected fault happens. Both network message drops and individual state machine failures are modeled as events. Modeling a fault as an event in P fully automates fault-injection and enables systematic testing of failover under a huge number of event orderings and faults, with little effort from the programmer.
The systematic testing capabilities in P thoroughly search over the choices induced by nondeterministic ordering of concurrently dispatched events. However, its capabilities are limited with respect to handling explicit data input, especially when that input is over large domains. This limitation makes it difficult to apply P to applications such as robotics where a significant source of complexity is decision-making under uncertain input. Dealing with large and uncertain input domains is an ongoing research investigation. We are investigating both symbolic and probabilistic techniques to deal with this challenge.
Download more from Microsoft’s P family of languages and tools