In-Situ Model Checking of MPI Parallel Programs

August 27, 2007
Ganesh Gopalakrishnan | University of Utah, Salt Lake City

MPI is the de facto standard for programming cluster machines used in high performance computing. Even though MPI programs are not as error-prone to write as thread programs based on shared memory, the use of wildcard communications, split operations (posting and later testing), and recently added weak shared memory extensions to MPI all can, nevertheless, render MPI programs quite buggy. Extracting models from MPI programs and using existing model checkers suffers from many problems: extracting models is not easy; also the assumption that the MPI calls work can be unrealistic. In our ongoing work, we run the actual MPI programs, trapping MPI calls, and allowing a custom-designed scheduler process to manifest process interleavings. We employ a dynamic partial order reduction algorithm to cut down the number of interleavings. We take precautions so that the scheduler abides by MPI’s internal progress engine’s conventions. Our resulting ISP (In-Situ Partial Order) tool has detected known deadlocks in a byte-range locking protocol that uses MPI’s weak memory features. It has also found deadlocks in simple message-passing MPI programs. The talk will introduce some MPI commands, their semantics, the dependence relation identified for MPI, and the DPOR algorithm used. It will summarize the directions being pursued to make ISP work more efficiently. Other work in DPOR in our group (e.g. a distributed DPOR algorithm for PThread programs) will be summarized. In summary, the talk will show how DPOR can be implemented in two actual parallel programming contexts. This is joint work with the Gauss group members : http://www.cs.utah.edu/formal_verification

Speaker Details

Ganesh Gopalakrishnan has been on faculty at the University of Utah, School of Computing, since 1986, working on hardware modeling, self-timed circuits, pipelined processors, cache coherence, and memory consistency. His efforts beginning 2003 have a dominant focus on putting dynamic verification tools in the hands of users who grapple with concurrency. He is teaching a class on multi-core computing, and hopes to educate undergraduates in the value of such tools (including CHESS). He also has a student building an FPGA with nine MIPS cores talking the MCAPI protocol on a fabric also on hardware. His efforts have immensely benefited from advice received from researchers at IBM, Intel, FreeScale, and Microsoft. His funding sources are Microsoft, NSF, and SRC.