# Effective Program Verification for Relaxed Memory Models 

January 31, 2008<br>Technical Report<br>MSR-TR-2008-12

Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052

This page intentionally left blank.

# Effective Program Verification for Relaxed Memory Models 

Sebastian Burckhardt Madanlal Musuvathi<br>Microsoft Research<br>\{sburckha, madanm\}@microsoft.com


#### Abstract

Program verification for relaxed memory models is hard. The high degree of nondeterminism in such models challenges standard verification techniques. This paper proposes a new verification technique for the most common relaxation, store buffers. Crucial to this technique is the observation that all programmers, including those who use low-lock techniques for performance, expect their programs to be sequentially consistent. We first present a monitor algorithm that can detect the presence of program executions that are not sequentially consistent due to store buffers while only exploring sequentially consistent executions. Then, we combine this monitor with a stateless model checker that verifies that every sequentially consistent execution is correct. We have implemented this algorithm in a prototype tool called Sober and present experiments that demonstrate the precision and scalability of our method. We find relaxed memory model bugs in several programs, including a previously unknown bug in a production-level concurrency library that would have been difficult to find by other means.


## 1. Introduction

Developers of performance-critical multi-threaded software often try to avoid the overhead of traditional locking by either making direct use of hardware primitives for atomic operations (such as interlocked exchange, or compare-and-swap), or by employing regular loads and stores for synchronization purposes. Unfortunately, such "low-lock" programs are notoriously hard to get right [4, 21]. Subtle bugs can arise in these programs due to memory reordering caused by the relaxed memory model of the underlying hardware [1]. These errors are hard to find and debug as they most often show up only in specific thread interleavings and in particular hardware configurations. On the other hand, low-lock code is heavily used both in low-level libraries and in critical paths of a system. Because these parts are crucial to the reliability of the entire system, it is important to develop verification techniques.

In general, the same program may exhibit more executions on a relaxed model than on a sequentially consistent (SC) machine [19], as illustrated in Fig. 1. Let $\mathcal{T}_{\pi}^{Y}$ denote the set of executions of program $\pi$ on memory model $Y$. Most existing program verification tools can not verify directly whether the executions in $\mathcal{T}_{\pi}^{Y}$ are correct (unless $Y=S C$ ). A few specialized memory model sensitive verification tools exist $[4,14,23,26,27]$ but scalability and automation remain a challenge.

A key observation of this paper is that programmers, even those writing low-lock code, expect their programs to be sequentially consistent. They design their programs to be correct for SC executions and insert memory ordering fences to counter relaxations where necessary. In particular, any program execution that is not SC is almost always an error, resulting either from an insufficient use of fences or a misunderstanding of the underlying memory model.

This observation suggests that we can sensibly verify the relaxed executions $\mathcal{T}_{\pi}^{Y}$ by solving the following two verification problems separately:

1. Use standard verification methodology for concurrent programs to show that the executions in $\mathcal{T}_{\pi}^{S C}$ are correct.
2. Use specialized methodology for memory model safety verification, showing that $\mathcal{T}_{\pi}^{Y}=\mathcal{T}_{\pi}^{S C}$. We say the program $\pi$ is $Y$-safe if $\mathcal{T}_{\pi}^{Y}=\mathcal{T}_{\pi}^{S C}$.

In this paper we focus on verifying memory model safety for the most common relaxation in modern multiprocessors, store buffers with store-load forwarding. The corresponding memory model is historically called $T S O$ (total store order) [25], and we use the terms $T S O$-safety and store buffer safety interchangeably. Under $T S O$, processors may delay the effect of a store instruction in a processor-local FIFO buffer (to hide the memory latency). While the values of these store instructions are immediately visible to the local processor, other processors see these values only when the store buffer is committed at a later time. Fig. 1(b) shows a simple example. We provide a rigorous characterization of TSO in Section 3.3.

Apart from the fact that store buffers are so common (as apparent in Fig. 1(a), $\mathcal{T}_{\pi}^{\text {TSO }} \subseteq \mathcal{T}_{\pi}^{Y}$ for almost all models $Y$ ), our motivation for focusing on TSO largely arises from the need to prepare the huge body of legacy code heavily optimized to run on x86 machines for future multicore chip generations. These processors are likely to make increased use of store buffers but are otherwise fairly conservative as far as the memory model is concerned [16].

The main contribution of this paper is a technique for checking the store buffer safety of a program while only exploring its sequentially consistent executions, which lets us perform the steps 1 and 2 above simultaneously. Our technique relies on a notion of borderline execution, which is an SC execution that can be extended into an execution in $\mathcal{I}_{\pi}^{T S O} \backslash \mathcal{T}_{\pi}^{S C}$. We establish that a program is store buffer safe exactly if there are no borderline executions (Theorem 7). Then we present an efficient, precise monitor for detecting borderline executions, using a novel generalized vector clock algorithm.

We have implemented these ideas in a prototype tool called Sober. Sober combines our store buffer safety monitor with the stateless model checker CHESS [22] which systematically enumerates the $S C$ executions of a bounded concurrent test program and checks them for errors such as null pointers or assertion violations. In principle, Sober terminates with one of three possible outputs. First, Sober may detect a regular program error and output an erroneous execution. Second, Sober may report that the program is not store buffer safe. Finally, Sober may terminate without finding an error, proving that all TSO executions of the program are correct. In practice, exhaustive verification is too time-consuming for most

(a)

| Initially: $X=Y=0$ |
| :--- |
| processor 1 processor 2 <br> $\mathrm{X}=1$ $\mathrm{Y}=1$ <br> $r_{1}=\mathrm{Y}$ $r_{2}=\mathrm{X}$ |
| Eventually: $r_{1}=0, r_{2}=0$ |

When the two threads run on different processors, the stores to $X$ and $Y$ in the first line can possibly be delayed by the store buffers in these processors. Subsequent loads in the second line see the initial values of $X$ and $Y$ if the store buffers have not yet been committed.
(b)

Figure 1. (a) A comparison of various memory models [5, 9, 15, 17, 25]. (b) An execution that is possible on TSO but not on SC.
programs and we resort to iterative context-bounding [22], which provides verification guarantees up to a specific preemption bound.

Section 5 describes our initial experiments. Using Sober we found and fixed store buffer issues in several programs, including Dekker's mutual exclusion protocol [2] and the Bakery protocol [18]. We got our greatest success so far when we applied Sober to a component of a concurrency library at Microsoft. This component implements a low-lock datastructure. Sober demonstrated a store buffer problem that the developer immediately agreed was a real error. This bug was never detected during the extensive codereview and testing the component underwent.

## 2. Related Work

Prior work has addressed the verification of programs for relaxed memory models using explicit state enumeration [23, 6, 14] and using constraint solving [12, 28, 3, 4]. Our work improves upon them in scalability.

To our knowledge, this paper is the first to demonstrate the possibility of program verification without exploring the additional nondeterminism of memory-model relaxation. See the experiments in Huynh and Roychoudhury [14] for the state space explosion caused by this nondeterminism even for simple programs.

This paper is definitely not the first to observe that sequential consistency is the most natural memory model for programmers [19, 1, 13]. The Java Memory Model [20] guarantees sequential consistency for a broad class of programs, namely those which are data-race free. In contrast, our characterization of memory model safety precisely captures those programs which behave sequentially consistent in a memory model. In particular, a program with data-races might still be memory-model safe.

Specialized algorithms to automatically insert fences based on static analysis [24, 7] can guarantee memory-safety in principle. However, doubts remain about their precision in the presence of aliasing, loops, and conditionals and the performance implication of conservative fence insertion. Also, the memory models considered in these algorithms assume atomic memory and cannot model store buffers, the main emphasis of this paper.

## 3. Problem Formulation

We represent the relevant aspects of a program executions by a memory trace, or just trace. A trace is a collection of events, each representing a memory access (either a store, a load, or an interlocked operation ${ }^{1}$ ) by a specific processor to a specific address. Each event has an issue index, which is a sequence number relative

[^0]to all events by the same processor. Furthermore, each event has a coherence index, which is the sequence number of the value that is read or written by the event, relative to the entire value sequence written to the targeted memory location during the execution.

### 3.1 Traces

Formally, let $O p=\{s t, l d, i l\}$, let $\mathbb{N}$ be the set of natural numbers, let Proc $=\{1, \ldots, N\}$ be a finite set of processor identifiers for some fixed bound $N \in \mathbb{N}$, let $A d r$ be a finite set of memory addresses, and let $\mathbb{N}_{0} \subseteq \mathbb{Z}$ be the set of nonnegative integers. Then we define the set of events as $E v t=O p \times \operatorname{Proc} \times \mathbb{N} \times A d r \times \mathbb{N}_{0}$, and we denote elements $e \in E v t$ using the syntax $o(p, i, a, c)$, where $o \in O p, p \in \operatorname{Proc}, i \in \mathbb{N}$ is the issue index, $a \in A d r$, and $c \in \mathbb{N}_{0}$ is the coherence index. We use corresponding projection functions $o(e), p(e), i(e), a(e), c(e)$ for an event $e$. Given a set $E \subseteq E v t$ of events, we define the following subsets for notational convenience:

$$
\begin{aligned}
\text { (commands by proc. } p \text { ) } & E(p)=\{e \in E \mid p(e)=p\} \\
\text { (load events) } & L(E)=\{e \in E \mid o(e)=1 d\} \\
\text { (store events) } & S(E)=\{e \in E \mid o(e)=s t\} \\
\text { (events that write) } & W(E)=\{e \in E \mid o(e) \in\{\text { st, il }\}\} \\
\text { (events that read) } & R(E)=\{e \in E \mid o(e) \in\{1 d, \text { il }\}\} \\
\text { (events that write to } a) & W(E, a)=\{e \in W(E) \mid a(e)=a\}
\end{aligned}
$$

We call a function $f: E v t \rightarrow \mathbb{N}$ an index function for a subset $S^{\prime} \subseteq E v t$ if $f\left(S^{\prime}\right)=\left\{1, \ldots,\left|S^{\prime}\right|\right\}$ (including the special case where $S^{\prime}$ is empty).

DEFINITION 1 (Traces). $A$ trace is a subset $E \subseteq$ Evt satisfying
(E1) For all $p \in$ Proc, $i$ is an index function for $E(p)$.
(E2) For all $a \in A d r, c$ is an index function for $W(E, a)$.
(E3) For all $l \in L(E)$, either $c(l)=0$, or there exists a $w \in W(E, a(l))$ such that $c(l)=c(w)$.

Define $\mathcal{T} \subseteq \mathcal{P}(E v t)$ to be the set of all traces. We say a trace $E$ is a prefix of a trace $E^{\prime}$ if $E \subseteq E^{\prime}$.

To reason about traces, we introduce binary relations $\rightarrow_{p}$ and $\rightarrow_{c}$ :

- We use the program order $\rightarrow_{p} \subseteq$ Evt $\times$ Evt to describe the relative order of events by the same processor. Specifically, we define $e \rightarrow_{p} e^{\prime}$ if and only if $p(e)=p\left(e^{\prime}\right)$ and $i(e)<i\left(e^{\prime}\right)$. For any trace $E, \rightarrow_{p}$ is a partial order on $E$ and a total order on $E(p)$ for all $p \in$ Proc.
- We use the conflict order $\rightarrow_{c} \subseteq E v t \times E v t$ to describe the relative order of conflicting accesses (where we call two accesses $e, e^{\prime} \in$ Evt conflicting if $a(e)=a\left(e^{\prime}\right)$ and $\left\{e, e^{\prime}\right\} \cap W(E v t) \neq$ $\emptyset$ ). Specifically, we define: $e \rightarrow_{c} e^{\prime}$ if and only if $a(e)=a\left(e^{\prime}\right)$ and either (1) $o\left(e^{\prime}\right) \in W(E v t)$ and $c(e)<c\left(e^{\prime}\right)$, or (2) $\left(e, e^{\prime}\right) \in W(E v t) \times L(E v t)$ and $c(e) \leq c\left(e^{\prime}\right)$. The conflict
order is not actually an 'order' in the mathematical sense because it is not transitive.


### 3.2 Memory Models

We now proceed to define the memory models $S C$ (sequential consistency) and TSO (total store order) using an axiomatic style. To state the definitions concisely, we define the binary relation $\rightarrow_{h b}$, called happens-before relation, to be the union of the program and conflict orders: $\rightarrow_{h b}=\left(\rightarrow_{p} \cup \rightarrow_{c}\right)$. Note that this definition does not make $\rightarrow_{h b}$ implicitly transitive; we will take the transitive closure $\rightarrow_{h b}^{*}$ explicitly if required by the context.
DEFINITION 2 (SC). Define the set $\mathcal{T}^{S C} \subseteq \mathcal{T}$ of sequentially consistent traces to consist of all traces $E$ that satisfy the following condition:

## (SC1) The relation $\rightarrow_{h b}$ is acyclic on $E$.

To define $T S O$ for any given event set $E$, we first define the relaxed happens-before relation $\rightarrow_{r h b}$ :
$\rightarrow_{r h b}=\rightarrow_{h b} \backslash\left\{\left(e, e^{\prime}\right) \mid e \rightarrow_{p} e^{\prime} \wedge o(e)=s t \wedge o\left(e^{\prime}\right)=l d\right\}$
Thus the $\rightarrow_{r h b}$ relation does not put a happens-before edge between a store and a subsequent load of the same processor (even if they have the same address). This reflects the existence of a store buffer: a store may globally commit after subsequent loads by the same processor, and thus not globally appear as 'happening before the load'.
Definition 3 (TSO). Define the set $\mathcal{T}^{T S O} \subseteq \mathcal{T}$ of totally-storeordered traces to consist of all traces $E$ that satisfy the following conditions:
(TSO1) The relation $\rightarrow_{r h b}$ is acyclic on $E$.
(TSO2) never $\left(e \rightarrow_{p} e^{\prime} \wedge e^{\prime} \rightarrow_{c}\right.$ e) for any $e, e^{\prime} \in E$
The axiom (TSO2) is required to guarantee that loads correctly "snoop" the store buffer: the coherence index of a load may not be less than that of a previous store to the same address by the same processor. We establish the connection between these concise axiomatic definitions and a more verbose operational description of $S C$ and $T S O$ in the appendix.

### 3.3 Program Traces

We now formally define the set of traces $\mathcal{T}_{\pi}^{Y}$ that a program $\pi$ may exhibit on a memory model $Y \in\{S C, T S O\}$. To keep our formalization light, we represent a program $\pi$ abstractly by a function $n e x t_{\pi}: \mathcal{T} \times$ Proc $\rightarrow \mathcal{P}(O p \times A d r)$. The set $\operatorname{next}_{\pi}(E, p)$ describes what instructions (combinations of operation and address) may possibly be issued by processor $p$ next, after having executed $E$. For a trace $E$, let last $(E, p)$ be the element $e \in E(p)$ such that $i(e)$ is maximal, or undefined if $E(p)=\emptyset$. We say that a program $\pi$ is locally deterministic if for all $(E, p) \in \operatorname{dom} n^{n} \mathrm{e}_{\pi}$, we have (1) $\mid$ next $_{\pi}(E, p) \mid \leq 1$, and (2) for all prefixes $E^{\prime} \subseteq E$ such that $\operatorname{last}\left(E^{\prime}, p\right)=\operatorname{last}(E, p)$, we have $\operatorname{next}_{\pi}(E, p)=\operatorname{next}_{\pi}\left(E^{\prime}, p\right)$. In the following, we will assume without further mention that all programs are locally deterministic. For a trace $E \in \mathcal{T}$, define the set of possible successor events under program $\pi$ as

$$
\begin{aligned}
\operatorname{succ}_{\pi}(E)=\{e \in(E v t \backslash E) \mid(E \cup\{e\} & \in \mathcal{T}) \\
\text { and } \operatorname{next}_{\pi}(E, p(e)) & =(o(e), a(e))\} .
\end{aligned}
$$

Definition 4 (Program Traces). For a program $\pi$ and memory model $Y \in\{S C, T S O\}$, define the set of traces $\mathcal{T}_{\pi}^{Y}$ inductively as the smallest set satisfying (i) $\emptyset \in \mathcal{T}_{\pi}^{Y}$, and (ii) for all $E \in \mathcal{T}_{\pi}^{Y}$ and $e \in$ succ $_{\pi}(E)$ such that $E \cup\{e\} \in \mathcal{T}^{Y}$, we have $E \cup\{e\} \in \mathcal{T}_{\pi}^{Y}$.

Definition 5 (Store Buffer Safety). The program $\pi$ is called store buffer safe if and only if $\mathcal{T}_{\pi}^{\text {TSO }}=\mathcal{T}_{\pi}^{S C}$.

## 4. Solution

We now describe how we can check store buffer safety by exploring $\mathcal{T}_{\pi}^{S C}$ only. The idea is to look for borderline traces which are defined as follows.

Definition 6 (Borderline Trace). A sequentially consistent trace $E \in \mathcal{T}_{\pi}^{\text {SC }}$ of a program $\pi$ is called $a$ borderline trace if there exists an $e \in \operatorname{succ}_{\pi}(E)$ such that $E \cup\{e\} \in\left(\mathcal{T}_{\pi}^{\text {TSO }} \backslash \mathcal{T}_{\pi}^{S C}\right)$.

THEOREM 7. A program $\pi$ is store buffer safe if and only if it has no borderline traces.

PROOF. If $E \in \mathcal{T}_{\pi}^{S C}$ is a borderline trace, then there exists a trace $E \cup\{e\} \underset{( }{\in}\left(\mathcal{T}_{\pi}^{T S O} \backslash \mathcal{T}_{\pi}^{S C}\right)$ implying $\mathcal{T}_{\pi}^{S C} \neq \mathcal{T}_{\pi}^{\text {TSO }}$. Conversely, assume $\mathcal{T}_{\pi}^{S C} \neq \mathcal{T}_{\pi}^{T S O}$. Because $\mathcal{T}_{\pi}^{S C} \subseteq \mathcal{T}_{\pi}^{T S O}$, there must exist $E \in\left(\mathcal{T}_{\pi}^{T S O} \backslash \mathcal{T}_{\pi}^{S C}\right)$. By construction of $\mathcal{T}_{\pi}^{T S O}$, there exist traces $E_{0}, \ldots, E_{n} \in \mathcal{T}_{\pi}^{\text {TSO }}$ and events $e_{1}, \ldots, e_{n}$ such that $E_{0}=\emptyset$, $\left\{e_{k}\right\}=E_{k} \backslash E_{k-1}$, and $E_{n}=E$. Because $E_{n} \notin \mathcal{T}_{\pi}^{S C}$ but $E_{0} \in \mathcal{T}_{\pi}^{S C}$, there exists a minimal $k$ such that $E_{k} \notin \mathcal{T}_{\pi}^{S C}$. This implies that $E_{k-1} \in \mathcal{T}_{\pi}^{S C}$ and $E_{k-1}$ is a borderline trace (because $E_{k-1} \cup\left\{e_{k}\right\} \in\left(\mathcal{T}_{\pi}^{\text {TSO }} \backslash \mathcal{T}_{\pi}^{S C}\right)$ ).

The following cycle characterization lemma provides an efficient method to detect borderline traces. For a trace $E$, let $\operatorname{last} R(E, p)$ be the element $e \in E(p) \cap R(E)$ such that $i(e)$ is maximal, or be undefined if $(E(p) \cap R(E))=\emptyset$; and let write $(E, a, c)$ denote the element $e \in W(E, a)$ such that $c(e)=c$ if it exists, or be undefined otherwise.

Lemma 8 (Cycle Characterization). Let $E \in \mathcal{T}_{\pi}^{S C}$ be a sequentially consistent trace of $\pi$, and let $e=o(p, i, a, c) \in \operatorname{succ}_{\pi}(E)$. Let $E^{\prime}=E \cup\{e\}$. Then:
(1) $E^{\prime} \notin \mathcal{T}_{\pi}^{S C}$ if and only if $o=1 d$ and write $(E, a, c+1) \rightarrow_{h b}^{*}$ $\operatorname{last}(E, p)$.
(2) $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {TSO }}$ if and only if $o=1 d$ and either
(i) write $(E, a, c+1) \rightarrow_{\text {rhb }}^{*} \operatorname{last} R(E, p)$, or
(ii) there exists $c^{\prime}>c$ such that $p\left(\right.$ write $\left.\left(E, a, c^{\prime}\right)\right)=p$.

PROOF. $(\mathbf{1} \Leftarrow)$. If $o=l d$ and $\operatorname{write}(E, a, c+1) \rightarrow_{h b}^{*} \operatorname{last}(E, p)$, then

$$
e \rightarrow_{c} \operatorname{write}(E, a, c+1) \rightarrow_{h b}^{*} \operatorname{last}(E, p) \rightarrow_{p} e
$$

which forms a $\rightarrow_{h b}$-cycle, implying $E^{\prime} \notin \mathcal{T}^{S C}$ by (SC1), and thus $E^{\prime} \notin \mathcal{T}_{\pi}^{S C}$. $\mathbf{( 2} \Leftarrow$ ). either (i) or (ii) must hold; if (i) holds, we proceed as in case $(1 \Leftarrow)$ : we use $e \rightarrow_{c}$ write $(E, a, c+1)$ and $\operatorname{lastR}(E, p) \rightarrow_{p} e$ to construct a cycle (this time, a $\rightarrow_{\text {rhb }}$ cycle) which implies $E^{\prime} \notin \mathcal{T}^{\text {TSO }}$ by (TSO1), and thus $E^{\prime} \notin$ $\mathcal{T}_{\pi}^{\text {TSO }}$. If (ii) holds, then either write $\left(E, a, c^{\prime}\right) \rightarrow_{p}$ e or $e \rightarrow_{p}$ write ( $E, a, c^{\prime}$ ); but the latter is impossible because both $E$ and $E^{\prime}$ are traces (specifically, because $i$ is an index function on both $E(p)$ and $E^{\prime}(p)$ ). Therefore, write $\left(E, a, c^{\prime}\right) \rightarrow_{p} e$. Along with $e \rightarrow c$ write ( $E, a, c^{\prime}$ ) we conclude $E^{\prime} \notin \mathcal{T}^{T S O}$ by (TSO2), and thus $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {TSO }}$. $(\mathbf{1} \Rightarrow)$. Assume $E^{\prime} \notin \mathcal{T}_{\pi}^{S C}$. Then $E^{\prime} \notin \mathcal{T}^{S C}$ (by Def. 4(ii)), which means (SC1) does not hold: specifically, $E \cup\{e\}$ has a $\rightarrow_{h b}$-cycle. Because $\rightarrow_{h b}$ is acyclic on $E$ (because $E \in \mathcal{T}^{S C}$ ), it must be of the form $e \rightarrow_{h b} e_{1} \rightarrow_{h b} \ldots \rightarrow_{h b} e_{n} \rightarrow_{h b} e$ where all $e_{k} \in E$ and $n \geq 1$. Now, $e \rightarrow_{h b} e_{1}$ by definition implies that either $e \rightarrow_{p} e_{1}$ or $e \rightarrow_{c} e_{1}$. As reasoned earlier, it can not be the case that $e \rightarrow_{p} e_{1}$ (because $E$ and $E^{\prime}$ are both traces), thus $e \rightarrow_{c} e_{1}$. This implies that $o=l d$ (because $c$ is an index function on both $W(E, a)$ and $W\left(E^{\prime}, a\right)$ ). Because $e$ is a load and $e \rightarrow_{c} e_{1}$, we know $o\left(e_{1}\right) \in\{s t$, il $\}, a\left(e_{1}\right)=a$ and $c\left(e_{1}\right)>c$, and thus either write $(E, a, c+1)=e_{1}$ or write $(E, a, c+1) \rightarrow_{c} e_{1}$. Therefore $\operatorname{write}(E, a, c+1) \rightarrow_{h b}^{*} e_{n}$. Now, it can not be the case that $e_{n} \rightarrow_{c} e$ (otherwise $e_{n} \vec{c}_{c}^{*} e_{1}$ which creates a $\rightarrow_{h b}$-cycle within $E$, contradicting $E \in \mathcal{T}_{\pi}^{S C}$ ), thus

```
function is_store_buffer_safe ( \(e_{1} e_{2} \ldots e_{n}\) )
                                    returns boolean \{
    \(\operatorname{var} \mathrm{k}, \mathrm{p}, \mathrm{a}, \mathrm{c}: \mathbb{N}\); \(\operatorname{var} \mathrm{E}: \mathcal{T}\);
    E := \(\emptyset\);
    for (k := 1; k <= n; k++) \{
        if \(\left(o\left(e_{k}\right)=1 d\right)\) \{
            \(\mathrm{p}:=p\left(e_{k}\right) ; \mathrm{a}:=a\left(e_{k}\right) ; \mathrm{c}:=c\left(e_{k}\right) ;\)
            while \((\mathrm{c}>0)\) \{
                if \((\mathrm{p}=p(\) write \((\mathrm{E}, \mathrm{a}, \mathrm{c})))\)
                break;
            if (write (E, a, c) \(\rightarrow_{r h b}^{*} \operatorname{lastR(E,p))}\)
                break;
            if (write (E, a, c) \(\left.\rightarrow_{h b}^{*} \operatorname{last}(E, p)\right)\)
                return false;
            c := c - 1;
        \}
        \}
        \(\mathrm{E}:=\mathrm{E} \cup e_{k} ;\)
    \}
    return true;
\}
```

Figure 2. Our algorithm to monitor store buffer safety in a given interleaving.
$e_{n} \rightarrow_{p} e$. Therefore, either $e_{n}=\operatorname{last}(E, p)$ or $e_{n} \rightarrow_{p} \operatorname{last}(E, p)$. We can thus conclude that $\operatorname{write}(E, a, c+1) \rightarrow_{h b}^{*} \operatorname{last}(E, p)$ as desired. ( $\mathbf{2} \Rightarrow$ ). If $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {TSO }}$ then $E^{\prime} \notin \mathcal{T}^{\text {TSO }}$ (by Def. 4(ii)). Thus either (TSO1) or (TSO2) must be violated. First, assume that $E^{\prime}$ does not satisfy (TSO1). Just as in $(1 \Rightarrow)$ (but using the relation $\rightarrow_{r h b} \subseteq \rightarrow_{h b}$ ), we conclude that there exists a cycle of the form $e \rightarrow_{r h b} e_{1} \rightarrow_{r h b} \ldots \rightarrow_{r h b} e_{n} \rightarrow_{r h b} e$, that $e \rightarrow_{c} e_{1}$, that $o=1 d$, that write $(E, a, c+1) \rightarrow_{r h b}^{*} e_{n}$, and that $e_{n} \rightarrow_{p} e$. The latter implies that $o\left(e_{n}\right) \neq$ st (otherwise not $e_{n} \rightarrow_{r h b} e$ ), and therefore either $e_{n}=\operatorname{lastR}(E, p)$ or $e_{n} \rightarrow_{r h b} \operatorname{lastR}(E, p)$. Thus condition (i) is satisfied. Next, assume that $E^{\prime}$ does not satisfy (TSO2). Because $E$ does, and because we know that not $e \rightarrow_{p} e^{\prime}$ for any $e^{\prime} \in E$ (because $E$ and $E^{\prime}$ are both traces), there must exist an $e^{\prime} \in E$ such that $e^{\prime} \rightarrow_{p} e$ and $e \rightarrow_{c} e^{\prime}$. This implies $o(e)=l d$ (because $c$ is an index function on both $W(E, a)$ and $W\left(E^{\prime}, a\right)$ ). Because $e$ is a load and $e \rightarrow_{c} e^{\prime}$, we know $o\left(e^{\prime}\right) \in\{s t, i l\}$, $a\left(e^{\prime}\right)=a$ and $c\left(e^{\prime}\right)>c$. Thus, condition (ii) is satisfied with $c^{\prime}=c\left(e^{\prime}\right) . \square$

### 4.1 Monitor Algorithm

Fig. 2 shows our implementation of a monitor that can monitor store buffer safety in any interleaved execution of the program. It processes the events in the sequence in order (and can thus be used online or offline) and reports any detected borderline traces. We now qualify the soundness and completeness of this monitor. For a sequence $w=e_{1} \ldots e_{n} \in E v t^{*}$ of events, let $E_{w}=\left\{e_{1}, \ldots e_{n}\right\}$. The sequence $w$ is called an interleaving of a program $\pi$ if (1) the $e_{k}$ are pairwise distinct, (2) $E_{w} \in \mathcal{T}_{\pi}^{S C}$, (3) $e_{x} \rightarrow_{h b} e_{y} \Longrightarrow x<$ $y$, and (4) next $\left(E_{w}, p\right)=\emptyset$ for all $p \in$ Proc.

THEOREM 9 (Soundness). If an an interleaving $w$ of program $\pi$ is reported unsafe by our monitor, then $\pi$ is not store buffer safe.
PROOF. Assume is_store_buffer_safe(w) returns false for $w=e_{1} \ldots e_{n}$. Let $E, k, p, i, a$ and $c^{\prime}$ be the values of the program variables $\mathrm{E}, \mathrm{k}, \mathrm{p}, \mathrm{i}, \mathrm{a}$, and c at the time of the return, respectively. Then $E=\left\{e_{1}, \ldots, e_{k-1}\right\}$, and $e_{k}=l d(p, i, a, c)$ for some $c$. Let $e=e_{k}$, and let $e^{\prime}=l d\left(p, i, a, c^{\prime}-1\right)$. We now argue that $E^{\prime}=E \cup\left\{e^{\prime}\right\} \in\left(\mathcal{T}_{\pi}^{\text {TSO }} \backslash \mathcal{T}_{\pi}^{S C}\right)$, which implies that $E$ is a borderline trace and thus $\mathcal{T}_{\pi}^{S C} \neq \mathcal{T}_{\pi}^{\text {TSO }}$ by Theorem 7 as desired. First, note that $e^{\prime} \in \operatorname{succ}_{\pi}(E)$ because $E \cup\{e\} \in \mathcal{T}_{\pi}^{S C}$ implies $E \cup\left\{e^{\prime}\right\} \in \mathcal{T}$ and $(o, a) \in \operatorname{next}_{\pi}(E, p)$ (using that $\pi$ is locally deterministic). We can thus enlist the help of Lemma 8 to show $E^{\prime} \in\left(\mathcal{T}_{\pi}^{T S O} \backslash \mathcal{T}_{\pi}^{S C}\right)$. First, because the program returned
at line 14 , we know $\operatorname{write}\left(E, a, c^{\prime}\right) \rightarrow_{h b}^{*} \operatorname{last}(E, p)$, which implies $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {SC }}$ by Lemma 8, part (1). Second, because the program did not break at line 12 right before returning on line 14 , we know that not $\left(\right.$ write $\left.\left(E, a, c^{\prime}\right) \rightarrow_{r h b}^{*} \operatorname{last} R(E, p)\right)$. Moreover, because the while loop was not broken at line 10 , we know that $p\left(\right.$ write $\left.\left(E, a, c^{\prime \prime}\right)\right) \neq p$ for all $c^{\prime \prime} \geq c^{\prime}$. By Lemma 8, part (2) we conclude that $E^{\prime} \in \mathcal{T}_{\pi}^{\text {TSO }}$.

As for completeness, we clearly cannot detect all borderline traces by looking at a single interleaving $w$ only. However, it is possible to detect them reliably by checking a sufficient set of interleavings. Specifically, we call a set of interleavings $I \subseteq E v t^{*}$ a representative for program $\pi$ if for all $E \in \mathcal{T}_{\pi}^{S C}$ there exists an interleaving $w \in I$ such that $E \subseteq E_{w}$ and there are no $\rightarrow_{h b}$-edges from $E_{w} \backslash E$ into $E$.

Theorem 10 (Completeness). Let I be a representative set of interleavings of a program $\pi$. Then, if $\pi$ is not store buffer safe, our monitor will detect it on some interleaving $w \in I$.

PROOF. By Theorem 7, we know that $\mathcal{T}_{\pi}^{S C} \neq \mathcal{T}_{\pi}^{\text {TSO }}$ implies that there exists a borderline trace $E \in \mathcal{T}_{\pi}^{S C}$. Thus there exists an element $e=o(p, i, a, c) \in E v t$ such that $E^{\prime}=(E \cup\{e\}) \in$ $\mathcal{T}_{\pi}^{\text {TSO }} \backslash \mathcal{T}_{\pi}^{S C}$. Because $I$ is representative, it must contain an interleaving $w=e_{1} \ldots e_{n}$ such that $E \subseteq E_{w}$ is a prefix. Because $(o(e), a(e)) \in \operatorname{eext}_{\pi}(E, p)$, there must be a $k$ such that $p\left(e_{k}\right)=p$ and $i\left(e_{k}\right)=i$ (otherwise last $\left(E_{w}, p\right)=\operatorname{last}(E, p)$ and thus $\operatorname{next}_{\pi}(E, p)=\operatorname{next}_{\pi}\left(E_{w}, p\right)$, contradicting next $\left.\left(E_{w}, p\right)=\emptyset\right)$. We now claim that if the algorithm reaches the $k$-th iteration, it must return false (if it returns prior to that, it also returns false and we are satisfied). Let $E_{k}=\left\{e_{1}, \ldots, e_{k-1}\right\}$. By Lemma 8, part (1), we know that write $(E, a, c+1) \rightarrow_{h b}^{*} \operatorname{last}(E, p)$ within $E$. Now, by the choice of $k$, we know $E(p)=E_{k}(p)$, thus $\operatorname{last}(E, p)=\operatorname{last}\left(E_{k}, p\right)$, and because $w$ is an interleaving (respects $\left.\rightarrow_{h b}\right)$, this implies write $\left(E_{k}, a, c+1\right) \rightarrow_{h b}^{*} \operatorname{last}\left(E_{k}, p\right)$ within $E_{k}$. Moreover, we know that $c\left(e_{k}\right) \geq(c+1)$ because $w$ is an interleaving and write $\left(E_{k}, a, c+1\right)$ appears before $e_{k}$ in $w$. Thus, the while loop (which assigns $c\left(e_{k}\right)$ to the variable c initially, and then keeps decrementing it) must eventually return true at line 14 unless it is broken at either line 10 or line 12 . But that is not possible, for the following reasons. First, suppose line 10 breaks. Let $c^{\prime}$ be the value of the variable c at that time; then $c+1 \leq$ $c^{\prime} \leq c\left(e_{k}\right)$ and $p\left(\right.$ write $\left.\left(E_{k}, a, c^{\prime}\right)\right)=p$. Now, because $E(p)=$ $E_{k} \overline{(p)}$, we know $\operatorname{write}\left(E_{k}, a, c^{\prime}\right) \in E$. Thus, write $\left(E, a, c^{\prime}\right)=$ write $\left(E_{k}, a, c^{\prime}\right)$, implying $p\left(\right.$ write $\left.\left(E, a, c^{\prime}\right)\right)=p$ which in turn implies $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {TSO }}$ by Lemma 8, part (2ii), contradicting the assumption. Next, suppose line 12 breaks. Let $c^{\prime}$ be the value of the variable c at that time; then $c+1 \leq c^{\prime} \leq c\left(e_{k}\right)$ and write $\left(E_{k}, a, c^{\prime}\right) \rightarrow_{r h b}^{*} \operatorname{lastR}\left(E_{k}, p\right)$ within $E_{k}$. Now, because $E(p)=E_{k}(p), \operatorname{last} R\left(E_{k}, p\right)=\operatorname{lastR}(E, p)$. Because there are no $\rightarrow_{h b}$-edges (and thus no $\rightarrow_{r h b}$-edges) from $E_{w}$ into $E$, this implies that write $\left(E, a, c^{\prime}\right) \rightarrow_{r h b}^{*} \operatorname{lastR}(E, p)$. Because $c+1 \leq c^{\prime}$, this implies write $(E, a, c) \rightarrow_{r h b}^{*} \operatorname{last} R\left(E_{k}, p\right)$, which in turn implies $E^{\prime} \notin \mathcal{T}_{\pi}^{\text {TSO }}$ by Lemma 8, part (2i), contradicting the assumption. $\square$

A stateless model checker (such as Verisoft [11] or Chess[22]) can provide us with a representative set of interleavings if the program is bounded (we call a program bounded if there exists a number $M \in \mathbb{N}$ such that $|E|<M$ for all $E \in \mathcal{T}_{\pi}^{S C}$ ). The following theorem clarifies that this is true even if partial order reduction is employed. We call a set of interleavings $I \subseteq E v t^{*}$ a partial-order-complete set for program $\pi$ if for all interleavings $w$ of $\pi$, there exists a $w^{\prime}$ in $I$ such that $E_{w}=E_{w^{\prime}}$.

THEOREM 11. If I is a partial-order-complete set of interleavings for a bounded program $\pi$, then it is representative for $\pi$.

```
type timestamp: array[2*N] of }\mp@subsup{\mathbb{N}}{0}{}\mathrm{ ;
var lc: array[Proc] of timestamp;
    sc: array[Proc] of timestamp;
        mc1: array [Proc] [Adr] of timestamp;
        mc2: array[Adr] of timestamp;
initially lc[*][*] = sc[*][*] = mc1[*][*][*] = mc2[*][*] = 0;
function merge(ts 1, ... ts sn : timestamp) returns timestamp {
    return (max }\mp@subsup{\mp@code{m}}{i}{(ts}[1]),\ldots,\mp@subsup{\operatorname{max}}{i}{}(\mp@subsup{\textrm{ts}}{i}{}[\textrm{N}*2]))
}
function process_event(e : Evt) returns timestamp {
    match e with
        ld(p,i,a,c) ->
            ts := merge(lc[p], mc1[p][a]);
            ts[2*p] := ts[2*p] + 1; // advance load count for p
            lc[p]:= merge(lc[p], ts);
            mc2[a] := merge(mc2[a], ts);
        st (p,i,a,c) ->
            ts := merge(sc[p], lc[p], mc2[a]);
            ts[2*p+1] := ts[2*p+1] + 1; // advance store count for p
            forall q }\not=p\mathrm{ do
                mc1[q][a] := merge(mc1[q] [a], ts);
            mc2[a] := merge(mc2[a], ts);
            sc[p] := merge(sc[p], ts);
        il(p,i,a,c) ->
            ts := merge(sc[p], lc[p], mc2[a]);
            ts[2*p] := ts[2*p] + 1; // advance load count for p
            ts[2*p+1] := ts[2*p+1] + 1; // advance store count for p
            forall q \in Proc do
                mc1[q][a] := merge(mc1[q] [a], ts);
            mc2[a] := merge(mc2[a], ts);
            lc[p] := merge(lc[p], ts);
            sc[p] := merge(sc[p], ts);
    return ts;
}
```

Figure 3. A vector clock for tracking the transitive closure $\rightarrow_{r h b}^{*}$.

PROOF. We now prove Theorem 11. The significance of this theorem arises from the fact that most stateless model checkers, such as Verisoft [11] and CHESS [22], use partial-order reduction techniques $[10,8]$ to only explore a subset of all interleavings of the program. Theorem 11 shows that to prove the store buffer safety of a program it is sufficient to run the monitor algorithm of Section 4.1 only on the set of executions explored by any partial-order reduction algorithm.

We will assume the most commonly used definition of dependent events in a program. Two events are dependent if and only if they are performed by the same thread or access the same memory location with at least one of the accesses being a write. Effectively, two events in an execution are dependent if and only if they are ordered by $\rightarrow_{h b}^{*}$. Two events are independent if they are not dependent on each other. Two interleavings $w$ and $w^{\prime}$ are equivalent if one can be obtained from the other by iteratively commuting two consecutive independent events. Thus, $w$ and $w^{\prime}$ are equivalent if and only if $E_{w}=E_{w^{\prime}}$.

By definition, a partial-order-complete set for program $\pi$ contains at least one representative from each equivalence class of all interleavings of $\pi$. Therefore, the set of interleavings explored by any partial-order reduction algorithm is a partial-order-complete set. Now we can prove Theorem 11.

Let $E$ be an execution in $\mathcal{T}_{\pi}^{S C}$ and let $I$ be any partial-ordercomplete set for $\pi$. Using Definition 4 and the fact that $\pi$ is bounded, we can inductively generate a sequence of executions $E=E_{1}, E_{2}, \ldots, E_{n}$ such that for all $1 \leq i<n$, (1) $E_{i+1} \in \mathcal{T}_{\pi}^{S C}$, (2) $\exists e_{i} \in \operatorname{succ}_{\pi}\left(E_{i}\right): E_{i+1}=E_{i} \cup\left\{\overline{e_{i}}\right\} \wedge \forall e \in E_{i}: e \rightarrow_{h b}^{*} e_{i}$, and (3) $\operatorname{succ}_{\pi}\left(E_{n}\right)=\emptyset$. Informally, this sequence represents the execution of $\pi$ from the program state at $E$ till completion. Let $w$ be an interleaving of $\pi$ such that $E_{w}=E_{n}$. By definition, there exists an interleaving $w^{\prime} \in I$ such that $E_{w^{\prime}}=E_{w}=E_{n}$. Moreover, by construction, $E \subseteq E_{n}$ and there are no $\rightarrow_{h b}$-edges from
$E_{n} \backslash E$ into $E$. Since we started with an arbitrary $E \in \mathcal{T}_{\pi}^{S C}$, we conclude that $I$ is a representative set.

### 4.2 Vector Clocks

The pseudocode in Fig. 2 does not detail how to decide the conditions on lines 11 and 13 . While it is well known how to use vector clocks to compute the transitive closure $\rightarrow_{h b}^{*}$ for a given interleaving of length $n$ in time $O(n N)$, it is not immediately clear how to do the same for $\rightarrow_{r h b}^{*}$. We solved this problem by generalizing vector clocks (Def. 13 below) and by engineering a vector clock instance (Fig. 3) that can compute the transitive closure $\rightarrow_{r b}^{*}$ in time $O\left(n N^{2}\right)$.

THEOREM 12. Let $w=e_{0} \ldots e_{n}$ be an interleaving of some program $\pi$, and let $t_{1}, \ldots, t_{n}$ be the timestamps returned by the corresponding sequence of calls to process_event (Fig. 3). Then $e_{i} \rightarrow_{r h b}^{*} e_{j}$ if and only if $i \leq j$ and $t_{i}[k] \leq t_{j}[k]$ for all $k \in\{1, \ldots, 2 N\}$.

We first describe informally how this vector clock works. Our vector clock uses timestamps of a fixed width (here $2 N$, where $N$ is the maximal number of processors) and maintains a number of clocks (defined as global variables in Fig. 3). The computation of each timestamp follows the following pattern: (1) some of the clocks are read and merged, (2) some positions of the resulting vector are incremented to form the timestamp, and (3) the timestamp is merged back into some of the clocks. The following definition clarifies the conditions that underly this general mechanism (in(e) and out (e) represent the clock sets in step (1) and (3), respectively, and $\operatorname{gps}(e)$ represents the set of positions in step (2)).

Definition 13 (General Vector Clock). Let $\Sigma$ be a set of events, and let $\rightarrow$ be a binary relation on $\Sigma$. A general vector clock for $(\Sigma, \rightarrow)$ is a tuple $(C, G$, in, out, gps) where $C$ is a set of clocks,

| name and description | lines of code | context bound | no. of interleavings total borderline |  | falsification time [s] | verification time [s] |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  | SoBeR | CHESS |
| Fig. 1(b) | 42 | $\infty$ | 10 | 4 |  | $<0.1$ | <0.2 | $<0.2$ |
| dekker <br> (2 threads, 2 critical sections) | 82 | 1 | 5 | 4 | <0.1 | <0.2 | $<0.2$ |
|  |  | 2 | 36 | 23 | $<0.1$ | 0.39 | 0.37 |
|  |  | 3 | 183 | 50 | $<0.1$ | 1.9 | 1.8 |
|  |  | 4 | 1,219 | 124 | $<0.1$ | 13.2 | 13.0 |
|  |  | 5 | 8,472 | 349 | $<0.1$ | 106.0 | 100.6 |
| bakery <br> (2 threads, 3 critical sections) | 122 | 0 | 1 | 1 | $<0.1$ | $<0.2$ | $<0.2$ |
|  |  | 1 | 25 | 20 | $<0.1$ | 0.47 | 0.43 |
|  |  | 2 | 742 | 533 | < 0.1 | 10.3 | 9.8 |
|  |  | 3 | 12,436 | 8,599 | <0.1 | 189.0 | 181.0 |
| takequeue (2 threads, 6 operations) | 374 | 0 | 3 | 0 | not found | $<0.3$ | $<0.3$ |
|  |  | 1 | 47 | 14 | 0.34 | 0.72 | 0.69 |
|  |  | 2 | 402 | 189 | 0.43 | 5.2 | 4.9 |
|  |  | 3 | 2,318 | 1,197 | 0.74 | 28.9 | 27.8 |
|  |  | 4 | 9,147 | 5,321 | 0.84 | 125.5 | 118.9 |
|  |  | 5 | 29,821 | 17,922 | 0.86 | 481.5 | 461.6 |

Figure 4. Experiments on a 2.2 GHz Intel Core Duo laptop running Windows Vista.
$G$ is a set of groups, in, out are functions $\Sigma \rightarrow \mathcal{P}(C)$, and gps is a function $\Sigma \rightarrow \mathcal{P}(G)$ such that the following conditions are satisfied:
(VC1) for all $\sigma \in \Sigma, \operatorname{gps}(\sigma) \neq \emptyset$.
(VC2) for all $g \in G, \rightarrow$ is a total order on $\{\sigma \in \Sigma \mid g \in$ $g p s(\sigma)\}$.
(VC3) for all $\sigma, \sigma^{\prime} \in \Sigma$, we have (out $\left.(\sigma) \cap \operatorname{in}\left(\sigma^{\prime}\right) \neq \emptyset\right) \Leftrightarrow$ ( $\sigma \rightarrow \sigma^{\prime}$ ).
To conclude the proof of Theorem 12, we first connect the general definition of vector clock to Fig. 3. To do so, we define $\Sigma=E v t$ and $\rightarrow=\rightarrow_{r h b}$. Furthermore, we use clocks $C=$ $\{1 c[p], \mathrm{sc}[p], m c 1[p][a], m c 2[a] \mid a \in A d r, p \in \operatorname{Proc}\}$, and groups $G=\left\{r_{1}, w_{1}, r_{2}, w_{2}, \ldots, r_{N}, w_{N}\right\}$, and we define $g p s$ in such a way that $\left(r_{p} \in \operatorname{gps}(e) \Leftrightarrow e \in R(\operatorname{Evt}(p))\right)$ and $\left(w_{p} \in g p s(e) \Leftrightarrow\right.$ $e \in W(\operatorname{Evt}(p)))$. We then define $i n(e)$ to contain the clocks that get merged into the timestamp (in the respective match clause for $e$ in Fig. 3), and out to define the clocks to which a timestamp propagates. Now, (VC1) is satisfied because each access is a read or a write (possibly both), (VC2) is satisfied ( $\rightarrow_{r h b}$ is a total order within each $R(\operatorname{Evt}(p))$ and $W(\operatorname{Evt}(p))$ ), and (VC3) is satisfied (which we can check by manual, pairwise comparison of the match cases).

We now formalize the computation performed by a general vector clock and shows that it gives correct results. Given a general vector clock as in Def. 13, define the set of timestamps $T=\left(\mathbb{N}_{0}\right)^{G}$, define a partial order on timestamps $t_{1} \leq t_{2} \Leftrightarrow(\forall g \in G$ : $\left.t_{1}(g) \leq t_{2}(g)\right)$, and define the operations merge : $\mathcal{P}(T) \rightarrow T$ and increment : $T \times E \rightarrow T$ as follows: merge $(S)(g)=$ $\max _{t \in S} t(g)$ (with corner case $\operatorname{merge}(\emptyset)(g)=0$ ), and

$$
\operatorname{increment}(t, e)(g)= \begin{cases}t(g)+1 & \text { if } g \in g p s(e) \\ t(g) & \text { otherwise. }\end{cases}
$$

Given a sequence of events $\sigma_{1}, \ldots, \sigma_{n}$, we say the vector clock ( $C, G$, in, out, gps) computes the timestamps $t_{1}, \ldots t_{n} \in T$ if for each clock $c \in C$ there exist timestamps $c_{1}, \ldots c_{n} \in T$ such that the following conditions are satisfied:
(t1) $c_{1}(g)=0$ for all $c \in C, g \in G$.
(t2) $t_{i}=\operatorname{increment}\left(\operatorname{merge}\left\{c_{i} \mid c \in \operatorname{in}\left(\sigma_{i}\right)\right\}\right)$ for $1 \leq i \leq n$.
(t3) $c_{i+1}= \begin{cases}\operatorname{merge}\left\{c_{i}, t_{i}\right\} & \text { if } c \in \operatorname{out}\left(\sigma_{i}\right) \\ c_{i} & \text { otherwise }\end{cases}$

$$
\text { for } 1 \leq i \leq(n-1)
$$

THEOREM 14. Let $\Sigma$ be a set of events, let $\rightarrow$ be a binary relation on $\Sigma$, let $\sigma_{1}, \ldots, \sigma_{n} \in \Sigma$ be a sequence of events satisfying
$\left(\sigma_{i} \rightarrow \sigma_{j} \Rightarrow i<j\right)$, and let $t_{1}, \ldots, t_{n} \in T *$ be computed by a vector clock for $(\Sigma, \rightarrow)$. Then $\sigma_{i} \rightarrow^{*} \sigma_{j}$ if and only if $t_{i} \leq t_{j}$.

PROOF. In the following, let $P=\{1, \ldots, n\}$ be the set of positions. For $i \in P$, define the set $M_{i}=\left\{j \in P \mid \sigma_{j} \rightarrow \sigma_{i}\right\}$. To prepare for the proof, we first prove the following two properties for arbitrary $i, j \in P$ and $g \in G$ :

$$
\begin{gather*}
t_{i}=\operatorname{increment}\left(\operatorname{merge}\left\{t_{k} \mid k \in M_{i}\right\}, \sigma_{i}\right)  \tag{1}\\
\left(t_{i}(g) \leq t_{j}(g) \wedge g \in \operatorname{gps}\left(\sigma_{i}\right)\right) \Rightarrow \sigma_{i} \rightarrow^{*} \sigma_{j} \tag{2}
\end{gather*}
$$

To prove (1), note that merge is a "flat" function, meaning that $\operatorname{merge}(\operatorname{merge}(A), B)=\operatorname{merge}(A \cup B)$. Thus merge $\left\{c_{i} \mid\right.$ $\left.c \in \operatorname{in}\left(\sigma_{i}\right)\right\}=\operatorname{merge}\left(\bigcup_{c \in i n\left(\sigma_{i}\right)} \operatorname{merge}\left(\left\{t_{j} \mid(j<i) \wedge c \in\right.\right.\right.$ $\left.\left.\operatorname{out}\left(\sigma_{j}\right)\right\} \cup\left\{c_{1}\right\}\right)=\operatorname{merge}\left\{t_{j}| |^{\prime}(j<i) \wedge \operatorname{out}\left(\sigma_{j}\right) \cap \operatorname{in}\left(\sigma_{i}\right) \neq\right.$ $\emptyset\}=\operatorname{merge}\left\{t_{k} \mid k \in M_{i}\right\}$. By applying this to (t2), we get (1). To prove (2), assume it is not true. Then we can pick $j \in P$ minimal such that there exist $i, g$ such that $t_{i}(g) \leq t_{j}(g)$ and $g \in \operatorname{gps}\left(\sigma_{i}\right)$, yet not $\sigma_{i} \rightarrow^{*} \sigma_{j}$. Clearly, this implies $i \neq j$. Now, first consider the case $g \in \operatorname{gps}\left(\sigma_{j}\right)$. Because $i \neq j$ and by (VC2) and ( $\sigma_{i} \rightarrow \sigma_{j} \Rightarrow i<j$ ), this implies $\sigma_{i} \rightarrow \sigma_{j}$ which is a contradiction. Thus $g \notin g p s\left(\sigma_{j}\right)$. Moreover, $t_{j}(g) \neq 0$ (because $t_{i}(g) \neq 0$ and $t_{i}(g) \leq t_{j}(g)$ by assumption). Thus, the merge in (1) can not be empty, meaning that there exists a $k \in M_{j}$ such that $t_{j}(g)=t_{k}(g)$. By minimality of $j, \sigma_{i} \rightarrow^{*} \sigma_{k}$. But this implies a contradiction.

We now prove the two directions of the theorem. $(\Rightarrow)$. Assume there exist $k_{1}, \ldots, k_{l} \in P$ such that $\sigma_{i}=\sigma_{k_{1}} \rightarrow \cdots \rightarrow \sigma_{k_{l}}=\sigma_{j}$. Then $k_{m} \in M_{k_{m+1}}$ for $1 \leq m<l$. By (1) this implies $t_{k_{1}} \leq$ $\cdots \leq t_{k_{l}}$ and thus $t_{i} \leq t_{j} .(\Leftarrow)$. Assume $t_{i} \leq t_{j}$. Pick an arbitrary $c \in \operatorname{gps}\left(\sigma_{i}\right)$ (which is nonempty by (VC1)). Then $t_{i}(g)=t_{j}(g)$ and thus $\sigma_{i} \rightarrow^{*} \sigma_{j}$ by (2).

## 5. Experiments

We present experimental results for four C\# programs (Fig. 4). The largest one (takequeue) implements a low-lock datastructure and is part of a concurrency library at Microsoft. For all programs, Sober (1) falsified the original version (found that it is not store buffer safe), and (2) verified a fixed version (which we obtained by adding more memory fences whenever Sober showed us a borderline trace) up to some bound on the number of preemptions [22] (column 2).

We make two observations. First, a large percentage of interleavings trip the monitor (columns 3,4). Therefore, a violation is found quickly (column 5). This indicates that our monitor may be useful for falsification even in a plain testing setup (without do-
ing exhaustive space exploration). Second, when verifying a correct program, the number of interleavings and the verification time increase dramatically with the context bound as usual [22]; however, the overhead by the store buffer safety monitor is fairly low in practice (columns 6,7), indicating that it makes sense to turn it on by default within the Chess tool.

## 6. Conclusions and Future Work

We have presented a novel method to verify store buffer safety using a non-intrusive monitor that is run alongside sequentially consistent executions of the program. We have demonstrated that this method is scalable, automatic and precise enough to find store-buffer-related bugs in realistic low-lock code, such as concurrency libraries.

As future work, we consider including memory model relaxations other than store buffers, and we plan to apply our monitor to larger execution traces.

## References

[1] S. Adve and K. Gharachorloo. Shared memory consistency models: a tutorial. Computer, 29(12):66-76, 1996.
[2] M. Ben-Ari. Principles of Concurrent Programming. Prentice Hall Professional Technical Reference, 1982.
[3] S. Burckhardt, R. Alur, and M. Martin. Bounded verification of concurrent data types on relaxed memory models: A case study. In Computer-Aided Verification (CAV), LNCS 4144, pages 489-502. Springer, 2006.
[4] S. Burckhardt, R. Alur, and M. Martin. CheckFence: Checking consistency of concurrent data types on relaxed memory models. In Programming Language Design and Implementation (PLDI), pages 12-21, 2007.
[5] Compaq Computer Corporation. Alpha Architecture Reference Manual, 4th edition, January 2002.
[6] D. Dill, S. Park, and A. Nowatzyk. Formal specification of abstract memory models. In Symposium on Research on Integrated Systems, pages 38-52. MIT Press, 1993.
[7] X. Fang, J. Lee, and S. Midkiff. Automatic fence insertion for shared memory multiprocessing. In International Conference on Supercomputing (ICS), pages 285-294, 2003.
[8] C. Flanagan and P. Godefroid. Dynamic partial-order reduction for model checking software. In POPL 05: Principles of Programming Languages. ACM Press, 2005.
[9] B. Frey. PowerPC Architecture Book v2.02. International Business Machines Corporation, 2005.
[10] P. Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. LNCS 1032. Springer-Verlag, 1996.
[11] P. Godefroid. Model checking for programming languages using Verisoft. In POPL 97: Principles of Programming Languages, pages 174-186, 1997.
[12] G. Gopalakrishnan, Y. Yang, and H. Sivaraj. QB or not QB: An efficient execution verification tool for memory orderings. In Computer-Aided Verification (CAV), LNCS 3114, pages 401-413, 2004.
[13] M. Hill. Multiprocessors should support simple memory-consistency models. IEEE Computer, 31(8):28-34, 1998.
[14] T. Huynh and A. Roychoudhury. A memory model sensitive checker for C\#. In Formal Methods (FM), LNCS 4085, pages 476-491. Springer, 2006.
[15] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, November 2006.
[16] Intel Corporation. Intel 64 Architecture Memory Ordering White Paper, August 2007.
[17] International Business Machines Corporation. z/Architecture Principles of Operation, first edition, December 2000.
[18] L. Lamport. A new solution of dijkstra's concurrent programming problem. Communications of the ACM, 17(8):453-455, 1974.
[19] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comp., C-28(9):690691, 1979.
[20] J. Manson, W. Pugh, and S. Adve. The Java memory model. In Principles of Programming Languages (POPL), pages 378-391, 2005.
[21] V. Morrison. Understand the impact of low-lock techniques in multithreaded apps. MSDN Magazine, 20(10), October 2005.
[22] M. Musuvathi and S. Qadeer. Iterative context bounding for systematic testing of multithreaded programs. In Programming Language Design and Implementation (PLDI), pages 446-455, 2007.
[23] S. Park and D. L. Dill. An executable specification, analyzer and verifier for RMO (relaxed memory order). In Symposium on Parallel Algorithms and Architectures (SPAA), pages 34-41, 1995.
[24] D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282-312, 1988.
[25] D. Weaver and T. Germond, editors. The SPARC Architecture Manual Version 9. PTR Prentice Hall, 1994.
[26] Y. Yang, G. Gopalakrishnan, and G. Lindstrom. Memory-modelsensitive data race analysis. In International Conference on Formal Engineering Methods (ICFEM), LNCS 3308, pages 30-45. Springer, 2004.
[27] Y. Yang, G. Gopalakrishnan, and G. Lindstrom. Rigorous concurrency analysis of multithreaded programs. In PODC Workshop on Concurrency and Synchronization in Java Programs (CSJP), 2004.
[28] Y. Yang, G. Gopalakrishnan, G. Lindstrom, and K. Slind. Nemos: A framework for axiomatic and executable specifications of memory consistency models. In International Parallel and Distributed Processing Symposium (IPDPS), 2004.

## A. Operational Memory Models

In the following, we present our operational models for $S C$ and TSO. Operational memory models describe what traces are valid by specifying an automaton which emits events as it transitions. Traces are valid with respect to the operational model if and only if there exists an accepting run of the automaton such that the emitted events match the events in the trace. We then prove (Thm. 16 and 17) that the axiomatic definitions of $S C$ and TSO (Def. 2 and 3) are equivalent to the operational models.

We now describe this concept more formally. We call our labeled transition systems trace automata (note that they are not required to be finite), and work over a fixed set Evt of events (we defined this set previously). A trace automaton over Evt is a tuple $A=(S, T$, initial, accept, guard, apply) where $S$ is a set of states, $T \supseteq E v t$ is a set of transitions (transitions in $T \backslash E v t$ are called internal), initial : $S \rightarrow$ bool and accept $: S \rightarrow$ bool are predicates that characterize the initial and accepting states, guard maps each transition $t \in T$ to a boolean function $\operatorname{guard}\langle t\rangle: S \rightarrow$ bool (describing the guard of $t$ ), and apply maps each transition $t \in T$ to a function apply $\langle t\rangle: S \rightarrow S$ (describing the effect of $t$ on the state). We then say a sequence $w=w_{0} \ldots w_{n} \in T^{*}$ is an accepted run by $A$ if there exist states $s_{0}, \ldots, s_{n+1} \in S$ such that all of the following hold:

- $\operatorname{initial}\left(s_{0}\right)=$ true
- $\operatorname{accept}\left(s_{n+1}\right)=$ true
- $\operatorname{guard}\left\langle w_{i}\right\rangle\left(s_{i}\right)=$ true for all $0 \leq i \leq n$
- $\operatorname{apply}\left\langle w_{i}\right\rangle\left(s_{i}\right)=s_{i+1}$ for all $0 \leq i \leq n$

We define $\mathcal{L}(A) \subseteq T^{*}$ to be the set of all accepted runs of $A$. For a sequence $w \in \bar{T}^{*}$, let $(w \uparrow E v t)$ be the subsequence of $w$ that contains only the elements that are in Evt. We then define the set of memory traces of an event automaton $A$ as $\mathcal{T}(A)=\{w \uparrow$ Evt $\mid$ $w \in \mathcal{L}(A)\}$.

We now define the trace automata $A_{S C}$ and $A_{\text {TSO }}$. The automaton $A_{S C}$ has no internal transitions; the automaton $A_{T S O}$ has internal transitions $s t^{c}(p, i, a, c)$ that correspond to the time at which the store $\operatorname{st}(p, i, a, c)$ commits globally (which may happen at a later time than the issuing of the store).

DEFInition 15. Define the automaton $A_{S C}$ as shown in Fig. 5, and define the automaton $A_{\text {TSO }}$ as shown in Fig. 6.

We now prove that the axiomatic definitions of $S C$ and TSO (Def. 2 and 3) are equivalent to the operational models.
THEOREM 16. $\mathcal{T}\left(A_{S C}\right)=\mathcal{T}^{S C}$.

## PROOF.

We prove equality by showing mutual containment, starting with $\mathcal{T}\left(A_{S C}\right) \subseteq \mathcal{T}^{S C}$. Let $E \in \mathcal{T}\left(A_{S C}\right)$. By definition, this implies that there exists an accepting run $w=w_{0} \ldots w_{n} \in E v t^{*}$ of $A_{S C}$, with corresponding state sequence $s_{0}, \ldots, s_{n+1} \in S$, such that $E=\left\{w_{0}, \ldots, w_{n}\right\}$. We now need to show that $E$ is indeed a trace (Def. 1) and that it is sequentially consistent (Def. 2).

First, we show that $E$ is indeed a trace, that is, satisfies conditions (E1), (E2) and (E3) of Def. 1. To do so, we first look at how the $i[p]$ component of the state is updated during the run. A quick look at the pseudocode in Fig. 5 reveals that $i[p]$ starts out as 0 and gets incremented by each $w_{k}$ for which $p\left(w_{k}\right)=p$, and for each such $w_{k}$ we have $i\left(w_{k}\right)=\left(s_{k} \cdot \mathrm{i}[\mathrm{p}]\right)+1$. This implies that for all $p, i$ is an index function on $\left\{w_{k} \mid p\left(w_{k}\right)=p\right\}$ and thus on $E(p)$ (note that the events $w_{k}$ are pairwise unequal because they either go to different processors, or receive a different processor index). (E1) is thus satisfied. Similarly, we can look at the events that modify the $m[a]$ component of the state. We see
that m [a] starts out as 0 and gets incremented by each $w_{k}$ such that $a\left(w_{k}\right)=a$ and $o\left(w_{k}\right) \in\{s t, i l\}$, and for each such $w_{k}$ we have $c\left(w_{k}\right)=\left(s_{k} \cdot \mathrm{~m}[\mathrm{a}]+1\right)$. This implies that for all $a, c$ is an index function on $\left\{w_{k} \mid\left(a\left(w_{k}\right)=a\right) \wedge\left(o\left(w_{k}\right) \in\{s t, i l\}\right)\right\}$ and thus on $W(E, a)$. (E2) is thus satisfied. Finally, we can see that for each $w_{k}$ satisfying $o\left(w_{k}\right)=l d$, the coherence index $c\left(w_{k}\right)$ matches the current $m[a]$, and is thus either equal to 0 (the initial value of $m[a]$ ) or to the value last written to $m[a]$ by a previous store or interlocked event (and thus equal to the coherence index of that event). (E3) is thus satisfied and $E$ is indeed a trace.

Next, we need to show that $E$ is sequentially consistent according to Definition 2, that is, we need to show that $\rightarrow_{h b}$ is acyclic on $E$. Because $i[p]$ and $m[a]$ grow monotonically, we know that for any $w_{i}, w_{j}$, it must be the case that $w_{i} \rightarrow_{p} w_{j} \Rightarrow i<j$ and $w_{i} \rightarrow_{c} w_{j} \Rightarrow i<j$. Therefore, $w_{i} \rightarrow_{h b} w_{j} \Rightarrow i<j$. Thus, $\rightarrow_{h b}$ is acyclic on $\left\{w_{1}, \ldots, w_{n}\right\}=E$. This concludes the proof of $\mathcal{T}\left(A_{S C}\right) \subseteq \mathcal{T}^{S C}$.

We now proceed to the second half of the proof and show $\mathcal{T}^{S C} \subseteq \mathcal{T}\left(A_{S C}\right)$. Given any $E \in \mathcal{T}^{S C}$, let $w=w_{0} \ldots w_{n} \in E v t^{*}$ be a sequence such that $E=\left\{w_{0}, \ldots, w_{n}\right\}$ and such that $i<j$ whenever $w_{i} \rightarrow_{h b} w_{j}$ (such a sequence always exists because $E$ is finite and $\rightarrow_{h b}$ is acyclic by ( SC 1 )). We then claim that $w$ is an accepted run by $A_{S C}$. To show that this is indeed true, consider the state sequence $s_{0}, s_{1}, \ldots, s_{n}$ such that initial ${ }_{S C}\left(s_{0}\right)=$ true and $\operatorname{apply}_{S C}\left\langle w_{i}\right\rangle\left(s_{i}\right)=s_{i+1}$. Then it remains to show that for all $i$, $\operatorname{guard}\left\langle w_{i}\right\rangle\left(s_{i}\right)=$ true. We do this by examining the conditions that appear in these guards individually.

- The condition $i[p]==i-1$ (which appears in all guards of events in $E(p)$ ) is guaranteed to be satisfied because $i[p]$ starts as zero, is incremented by each $w_{i} \in E(p)$, is not modified by any events not in $E(p)$, and (E1) guarantees that $i$ is an index function on $E(p)$.
- The condition $m[a]==c-1$ (which appears in the guards of events in $W(E, a)$ ) is guaranteed to be satisfied because $\mathrm{m}[\mathrm{a}]$ starts as zero, is incremented by each $w_{l} \in W(E, a)$, is not modified by any events not in $W(E, a)$, and (E2) guarantees that $c$ is an index function on $W(E, a)$.
- The condition $m[a]==c$ (which appears in the guards of events in $L(E, a)$ ) is guaranteed for the following reason. Let $w_{x}=$ $\operatorname{ld}(p, i, a, c) \in E$ the load event in question. Note that $\mathrm{m}[\mathrm{a}]$ is only modified by events in $W(E, a)$. Now, by (E3), either $c=0$, or there exists an event $\operatorname{st}\left(p^{\prime}, i^{\prime}, a, c\right) \in E$. If $c=0$, then for all $w_{y} \in W(E, a)$, we have $w_{x} \rightarrow_{c} w_{y}$, which implies $x<y$. Thus m [a] still has the initial value 0 in state $s_{x}$, and the condition is satisfied. On the other hand, if there is an event $w_{y}=\operatorname{st}\left(p^{\prime}, i^{\prime}, a, c\right) \in E$, then by the same argument, $y<x$. Moreover, by the same argument again, all other writes $w_{z} \in W(E, a)$ must satisfy either $z<y$ or $x<z$, which implies that the condition $\mathrm{m}[\mathrm{a}]==\mathrm{c}$ is satisfied in state $s_{i}$.

THEOREM 17. $\mathcal{T}\left(A_{\text {TSO }}\right)=\mathcal{T}^{T S O}$.

## PROOF.

First, let us define the functions issue, commit : Evt $\rightarrow T_{\text {TSO }}$ as follows:

$$
\begin{aligned}
\operatorname{issue}(o(p, i, a, c)) & =o(p, i, a, c) \\
\operatorname{commit}(o(p, i, a, c)) & = \begin{cases}o(p, i, a, c) & \text { if } o \neq s t \\
s t^{c}(p, i, a, c) & \text { if } o=s t\end{cases}
\end{aligned}
$$

These functions help us reason about issue and commit transitions corresponding to events.


Figure 5. The trace automaton $A_{S C}$.


Figure 6. The trace automaton $A_{\text {TSO }}$.

We prove equality by showing mutual containment, starting with $\mathcal{T}\left(A_{T S O}\right) \subseteq \mathcal{T}^{\text {TSO }}$. Let $E \in \mathcal{T}\left(A_{T S O}\right)$. By definition, this implies that there exists an accepting run $w=w_{0} \ldots w_{n} \in T_{T S O}^{*}$ of $A_{\text {TSO }}$, with corresponding state sequence $s_{0}, \ldots, s_{n+1} \in S$, such that $E=\left\{w_{0}, \ldots, w_{n}\right\} \cap E v t$. We now need to show that $E$ is indeed a trace (Def. 1) and that it is totally-store-ordered (Def. 2).

First, we show that $E$ is indeed a trace, that is, satisfies conditions (E1), (E2) and (E3) of Def. 1. To do so, we first look at how the $i[p]$ component of the state is updated during the run. Just as in the proof of Thm. 16, we conclude that (E1) is satisfied (note that the internal transitions $s t^{c}(p, i, a, c)$ do neither read nor update $\left.\mathrm{i}[\mathrm{p}]\right)$. Next, we observe that the $\operatorname{st}(p, i, a, c)$ and $s t^{c}(p, i, a, c)$ events occur in matching pairs, with st preceding $s t^{c}$. Too see why, consider the variable $\mathrm{B}[\mathrm{p}]$ for each $p$. It starts out empty and ends empty (because of the accepting condition). Because $\mathrm{B}[\mathrm{p}]$ is a FIFO queue, we can match the add and pop calls, which gives the claimed match. Now, just as in the proof of Thm. 16, we figure that $c$ is an index function on $\left\{w_{k} \mid\left(a\left(w_{k}\right)=a\right) \wedge\left(o\left(w_{k}\right) \in\left\{s t^{c}\right.\right.\right.$, il $\left.\left.\}\right)\right\}$, and because of the matching this implies that $c$ is an index function
on $\left\{w_{k} \mid\left(a\left(w_{k}\right)=a\right) \wedge\left(o\left(w_{k}\right) \in\{s t, i l\}\right)\right\}$ which implies (E2). Finally, we can see that for each $w_{k}$ satisfying $o\left(w_{k}\right)=l d$, the coherence index $c\left(w_{k}\right)$ matches either the current $\mathrm{m}[\mathrm{a}]$ or the current $\mathrm{ml}[\mathrm{a}]$. It is thus either equal to 0 (the initial value of $\mathrm{m}[a]$ and $\mathrm{ml}[\mathrm{a}]$.) or to the value last written to $\mathrm{m}[\mathrm{a}$ ] or ml [a] by a previous store or interlocked event (and thus equal to the coherence index of that event). (E3) is thus satisfied and $E$ is indeed a trace.

Next, we need to show that $E$ satisfies conditions (TSO1) and (TSO2). Let $W=\left\{w_{0}, \ldots, w_{n}\right\}$, and define the total order $<$ on $W$ by $w_{i}<w_{j} \Leftrightarrow i<j$. Because (as remarked earlier) $s t(p, i, a, c)$ and $s t^{c}(p, i, a, c)$ events occur in matching pairs, we know $\operatorname{commit}(E) \subseteq W$ and for all $e \in E$, commit $(e) \geq e$. We now prove that $\rightarrow_{r h b}$ is acyclic on $E$ by showing the following claim:

$$
\text { for all } e, e^{\prime} \in E:\left(e \rightarrow_{r h b} e^{\prime}\right) \Rightarrow\left(\operatorname{commit}(e)<\operatorname{commit}\left(e^{\prime}\right)\right)
$$

We prove it by doing a case distinction on $e \rightarrow_{r h b} e^{\prime}$.

- $\left[e \rightarrow_{p} e^{\prime}\right]$. Then $e, e^{\prime} \in E(p)$ for some $p$. Because i [p] grows monotonically and is incremented by the events in $E(p)$, we know $e<e^{\prime}$. Now, do another case distinction.
- $[o(e) \neq s t]$ In this case $\operatorname{commit}(e)=e<e^{\prime} \leq$ commit $\left(e^{\prime}\right)$.
- $\left[o(e)=o\left(e^{\prime}\right)=s t\right] e$ and $e^{\prime}$ are pushed into the FIFO buffer $\mathrm{B}[\mathrm{p}]$ and popped by commit $(e)$ and $\operatorname{commit}\left(e^{\prime}\right)$, so their relative order is preserved: $e<e^{\prime} \Rightarrow \operatorname{commit}(e)<$ commit ( $e^{\prime}$ ).
- $\left[o(e)=s t\right.$ and $\left.o\left(e^{\prime}\right)=1 d\right]$ This case is not possible by definition of $\rightarrow_{r h b}$.
- $\left[o(e)=\right.$ st and $o\left(e^{\prime}\right)=$ il $]$ The guard condition of $e^{\prime}$ requires that the buffer $\mathrm{B}[\mathrm{p}]$ be empty, therefore $e<$ $\operatorname{commit}(e)<e^{\prime}=\operatorname{commit}\left(e^{\prime}\right)$.
- $\left[e \rightarrow_{c} e^{\prime}\right]$. Then $a(e)=a\left(e^{\prime}\right)$ for some $a \in \operatorname{Adr}$. We do a further case distinction.
- $\left[e, e^{\prime} \in W(E, a)\right]$ Because $\mathrm{m}[\mathrm{a}]$ grows monotonically and is incremented by the events in $\operatorname{commit}(W(E, a))$, we can see that $e \rightarrow_{c} e^{\prime} \Rightarrow \operatorname{commit}(e)<\operatorname{commit}\left(e^{\prime}\right)$.
- $\left[o(e)=l d\right.$ and $\left.e^{\prime} \in W(E, a)\right]$ Because $c(e)<c\left(e^{\prime}\right)$, and because $\mathrm{m}[\mathrm{a}$ ] grows monotonically, it can not be the case that commit $\left(e^{\prime}\right)$ happens before $e$. Thus commit $(e)=e<$ $\operatorname{commit}\left(e^{\prime}\right)$.
- $\left[o(e)=1 d\right.$ and $\left.o\left(e^{\prime}\right)=1 d\right]$ This case is not possible by definition of $\rightarrow{ }_{c}$.
- $\left[e \in W(E, a)\right.$ and $o\left(e^{\prime}\right)=l d$ and $\left.c\left(e^{\prime}\right)=c(e)\right]$. In the execution of $e^{\prime}$ it can not be the case that $\mathrm{ml}[\mathrm{p}, \mathrm{a}]==$ $c\left(e^{\prime}\right)$ because $e \notin E\left(p\left(e^{\prime}\right)\right)$ by definition of $\rightarrow_{r h b}$, and no other store could have written that value to $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ (by $\mathrm{E} 2)$. Thus it must be the case that $\mathrm{m}[\mathrm{a}]==c\left(e^{\prime}\right)$ which implies commit $(e)<e^{\prime}=\operatorname{commit}\left(e^{\prime}\right)$.
- $\left[e \in W(E, a)\right.$ and $o\left(e^{\prime}\right)=l d$ and $\left.\operatorname{cind}\left(e^{\prime}\right)>c(e)\right]$ In this case there must exist a $e^{\prime \prime} \in W(E, a)$ such that $e \rightarrow_{c} e^{\prime \prime}$ and $e^{\prime \prime} \rightarrow_{c} e^{\prime}$ and $\operatorname{cind}\left(e^{\prime \prime}\right)=c\left(e^{\prime}\right)$, and we can apply the respective subcases.

This concludes the proof of (TSO1). It remains to show that (TSO2) holds. Suppose $e \rightarrow_{p} e^{\prime}$. Then as before, $e, e^{\prime} \in E(p)$ for some $p$, and $a(e)=a\left(e^{\prime}\right)=a$ for some $a \in$ Adr. Furthermore, $e \rightarrow_{p} e^{\prime}$ implies $e<e^{\prime}$ because $\mathrm{i}[\mathrm{p}]$ is updated monotonically. Assume $e^{\prime} \rightarrow_{c} e$. The following case distinction shows that a contradiction results, which proves (TSO2) and concludes the proof of $\mathcal{T}\left(A_{\text {TSO }}\right) \subseteq \mathcal{T}^{\text {TSO }}$.

- $\left[e \in S(E)\right.$ and $\left.e^{\prime} \in L(E)\right]$. First, observe that $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ grows monotonically because the stores commit in the order issued. Now, because the execution of $e^{\prime}$ reads $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$, the assumption $e^{\prime} \rightarrow_{c} e$ implies $e^{\prime}<e$ which is a contradiction.
- $\left[e \in L(E)\right.$ and $e^{\prime} \in S(E)$ and $\left.c(e)=c\left(e^{\prime}\right)\right]$. Then the execution of $e$ must satisfy either $\mathrm{ml}[\mathrm{p}, \mathrm{a}]==c(e)$ or $\mathrm{m}[\mathrm{a}]==$ $c(e)$. In either case, it follows that $e^{\prime}<e$ which is a contradiction.
- $\left[e \in L(E)\right.$ and $\left.e^{\prime} \in S(E)\right)$ and $\left.c\left(e^{\prime}\right)<c(e)\right]$ In this case, there must exist a $e^{\prime \prime} \in W(E, a)$ such that $e^{\prime} \rightarrow_{c} e^{\prime \prime}$ and $e^{\prime \prime} \rightarrow_{c} e$ and $\operatorname{cind}\left(e^{\prime \prime}\right)=c(e)$. If $e^{\prime} \notin E(p), \mathrm{a} \rightarrow_{r h b^{-}}$ cycle results and (TSO1) is violated, which is a contradiction. Otherwise, it must be the case that $e^{\prime} \rightarrow_{p} e^{\prime \prime}$ (because stores commit in the order issued) and we can apply the previous case.
- [Other cases] In all other cases, a $\rightarrow_{r h b}$-cycle results and (TSO1) is violated, which is a contradiction.

We now proceed to the second half of the proof and show $\mathcal{T}^{T S O} \subseteq \mathcal{T}\left(A_{T S O}\right)$. Now, suppose we are given a trace $E \in \mathcal{T}^{T S O}$. Then we define the transition set $E^{\prime}=i s s u e(E) \cup \operatorname{commit}(E)$ (note that this is not a disjoint union because issue $(e)=\operatorname{commit}(e)=$ $e$ for events $e \notin S(E)$ ). Now, let $\rightarrow$ be the least binary relation on $E^{\prime}$ such that the following conditions are satisfied for all $e_{1}, e_{2} \in E$ :
(R1) If $e_{1} \rightarrow_{r h b} e_{2}$ then $\operatorname{commit}\left(e_{1}\right) \rightarrow \operatorname{commit}\left(e_{2}\right)$
(R2) If $e_{1} \rightarrow_{p} e_{2}$ then issue $\left(e_{1}\right) \rightarrow$ issue $\left(e_{2}\right)$
(R3) If $\operatorname{commit}\left(e_{1}\right) \neq e_{1}$ then $e_{1} \rightarrow \operatorname{commit}\left(e_{1}\right)$
We now claim that $\rightarrow$ is acyclic on $E^{\prime}$, by showing that a cycle leads to a contradiction. Let $e_{1} \rightarrow \ldots \rightarrow e_{k-1} \rightarrow e_{k}=e_{1}$ be a minimal cycle in $E^{\prime}$. Then $k \geq 2$ because none of the rules (R1),(R2),(R3) introduce self-loops. Now, distinguish the following cases to conclude that $\rightarrow$ is acyclic on $E^{\prime}$ :

- $\left[o\left(e_{i}\right) \neq\right.$ st for all $\left.i\right]$. This implies that we can find $e_{i}^{\prime} \in E$ such that $e_{i}=\operatorname{commit}\left(e_{i}^{\prime}\right)$. Now, $e_{i} \rightarrow e_{i+1}$ implies $e_{i}^{\prime} \rightarrow_{\text {rhb }} e_{i+1}^{\prime}$ (because the $\rightarrow$ edge was either produced by (R1), or it was produced by (R2) in which case $e_{i}^{\prime} \rightarrow_{p} e_{i+1}^{\prime}$ implies $e_{i}^{\prime} \rightarrow_{r h b}$ $e_{i+1}^{\prime}$ because $\left.o\left(e_{i}^{\prime}\right) \neq s t\right)$. Thus the $e_{i}^{\prime}$ form a $\rightarrow_{r h b}$ cycle on $E$ which contradicts (TSO1).
- $\left[o\left(e_{i}\right)=s t\right.$ for some i $]$. Without loss of generality, we assume that the number of $i$ such that $o\left(e_{i}\right)=$ st is minimal, and that $o\left(e_{1}\right)=s t$. Now, it must be the rule (R2) that produces the edge $e_{k-1} \rightarrow e_{1}$ (because neither (R1) nor (R3) match $\left.o p\left(e_{1}\right)=s t\right)$. Thus, there exists a $x \in E$ such that $\operatorname{issue}(x)=e_{k-1}$ and $x \rightarrow_{p} e_{1}$. This means it can not be (R2) that produces $e_{1} \rightarrow e_{2}$ (otherwise we can shorten the cycle, because $\rightarrow_{p}$ is transitive). Clearly (R1) does not match either, so $e_{1} \rightarrow e_{2}$ must be produced by (R3). Therefore $e_{2}=\operatorname{commit}\left(e_{1}\right)$. Now, $x \rightarrow_{p} e_{1}$ implies $x \rightarrow_{r h b} e_{1}$, which implies commit $(x) \rightarrow \operatorname{commit}\left(e_{1}\right)=e_{2}$. Now, if it is the case that $\operatorname{issue}(x)=\operatorname{commit}(x)$, then $e_{k-1}=\operatorname{commit}(x)$ and thus $e_{k-1} \rightarrow e_{2}$ so we can shorten the cycle which is a contradiction. On the other hand, if $\operatorname{issue}(x) \neq \operatorname{commit}(x)$, then $x=\operatorname{issue}(x)=e_{k-1}$ and $x \neq \operatorname{commit}(x)$, so we apply (R3) to get $e_{k-1} \rightarrow \operatorname{commit}(x) \rightarrow \operatorname{commit}\left(e_{1}\right)=e_{2}$. Thus, we have found a cycle with one less store operation which contradicts minimality.
Now, let $w=w_{0} \ldots w_{n} \in T_{\text {TSO }}^{*}$ be a sequence of pairwise distinct transitions such that $\left\{w_{0}, \ldots, w_{n}\right\}=E^{\prime}$ and such that $i<j$ whenever $w_{i} \rightarrow w_{j}$ (such a sequence always exists because $E^{\prime}$ is finite and $\rightarrow$ is acyclic). We then claim that $w$ is an accepted run by $A_{\text {TSO }}$. To show that this is indeed true, consider the state sequence $s_{0}, s_{1}, \ldots, s_{n}$ such that initial ${ }_{T S O}\left(s_{0}\right)=$ true and $\operatorname{apply}_{\text {TSO }}\left\langle w_{i}\right\rangle\left(s_{i}\right)=s_{i+1}$. Then it remains to show that (1) for all $i, \operatorname{guard}_{T S O}\left\langle w_{i}\right\rangle\left(s_{i}\right)=$ true, and (2) $\operatorname{accept}_{T S O}\left(s_{n}\right)=$ true. We do this by examining the conditions that appear in these expressions individually.
- The condition $i[p]==i-1$ (which appears in the guards of events in $E(p)$ ) is guaranteed to be satisfied because $i[p]$ starts as zero and is incremented by each $w_{i} \in E(p)$. Now, (R2) guarantees that the latter are applied in order.
- The condition $m[a]==c-1$ (which appears in the guards of events in $W(E, a)$ ) is guaranteed to be satisfied because $\mathrm{m}[\mathrm{a}]$ starts as zero and is incremented by each $w_{l} \in W(E, a)$. Now, (R1) guarantees that the latter are applied in order.
- The condition $\mathrm{B}[\mathrm{p}]$. length ()$==0$ appearing in accept TSO is satisfied in the final state for all $p$ because (1) the queues are initially empty, ( 2 ) the only transitions that increment/decrement the queue size are $s t(p, i, a, c)$ and $s t^{c}(p, i, a, c)$, respectively, and (3) by construction, $E^{\prime}=\operatorname{issue}(E) \cup \operatorname{commit}(E)$ and
we can thus pairwise match the transitions $\operatorname{st}(p, i, a, c)$ and $s t^{c}(p, i, a, c)$ within $E^{\prime}$, and by (R3) we know that for each such pair, $s t(p, i, a, c)$ precedes $s t^{c}(p, i, a, c)$.
- The condition $B[p]$. length ()$==0$ appearing in the guards of interlocked events by processor $p$ is satisfied because as before, the only transitions that increment/decrement the queue size are $s t$ and $s t^{c}$ events by $p$. But thos are guaranteed to precede the interlocked event, by (R1).
- The condition $B[p]$. peek ()$==s t(p, i, a, c)$ appearing in the guard of $s t^{c}(p, i, a, c)$ is satisfied because (R3) guarantees that $\operatorname{st}(p, i, a, c)$ has been added to the FIFO buffer, and (R1) guarantees that we pop the elements in order.
- The condition $c==\max \{m[a], m l[p, a]\}$ which appears in the guards of load events is guaranteed for the following reason. Let $l=\operatorname{ld}(p, i, a, c) \in E$ be the load event in question. Note that $\mathrm{m}[\mathrm{a}]$ and $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ are only modified by events in $W(E, a)$. Now, by (E3), either $c=0$, or there exists an event $\operatorname{st}\left(p^{\prime}, i^{\prime}, a, c\right) \in E$. If $c=0$, then for all $s \in W(E, a)$, we have $l \rightarrow_{r h b} s$, which implies that $l$ precedes commit $(s)$ (by (R1)), and that $l$ precedes issue (s) if $p(s)=p$. Thus $\mathrm{m}[\mathrm{a}]$ and $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ still have the initial value 0 , and the condition is satisfied. On the other hand, if there is an event $s=\operatorname{st}\left(p^{\prime}, i^{\prime}, a, c\right) \in$ $E$, then distinguish the following two cases.
- $\left[p \neq p^{\prime}\right]$ Then $s \rightarrow_{r h b} l$, and all other writes $s^{\prime} \in W(E, a)$ must satisfy either $s^{\prime} \rightarrow_{c} s$ (meaning that $\operatorname{commit}\left(s^{\prime}\right)$ precedes commit $(s)$ by (R1)), or $l \rightarrow_{c} s^{\prime}$ (meaning that commit $\left(s^{\prime}\right)$ trails commit(l) by (R1)). Thus $\mathrm{m}[\mathrm{a}]=c$ when $l$ executes. Now, it remains to show that $m l[p, a]<=$ $c$. This follows trivially if $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ has its initial value 0 ; otherwise, suppose there is a store $s^{\prime}=\operatorname{st}\left(p, i^{\prime \prime}, a, c^{\prime}\right)$ with $c^{\prime}>c$ that writes to $\mathrm{ml}[\mathrm{p}, \mathrm{a}]$ before $l$. But then $l \rightarrow_{c} s^{\prime}$ which means not $s^{\prime} \rightarrow_{p} l$ by (TSO2), thus $l \rightarrow_{p} s^{\prime}$ and thus issue ( $l$ ) precedes issue ( $s^{\prime}$ ) which is a contradiction.
- $\left[p=p^{\prime}\right]$ Then $s \rightarrow_{r h b} l$, and all other writes $s^{\prime} \in W(E, a) \cap$ $E(p)$ must satisfy either $s^{\prime} \rightarrow_{c} s$ (implying $s^{\prime} \rightarrow_{p} s$ by (TSO2) and thus issue ( $s^{\prime}$ ) precedes issue( $s$ ) by (R2)), or $l \rightarrow_{c} s^{\prime}$ (implying $l \rightarrow_{p} s^{\prime}$ by (TSO2) and thus issue ( $l$ ) precedes issue ( $s^{\prime}$ ) by (R2)). Thus $\mathrm{ml}[\mathrm{p}, \mathrm{a}]=c$ when $l$ executes. Now, it remains to show that $\mathrm{m}[\mathrm{a}]<=c$. This follows trivially if $\mathrm{m}[a]$ has its initial value 0 ; otherwise, suppose there is a store $s^{\prime}=\operatorname{st}\left(p^{\prime \prime}, i^{\prime \prime}, a, c^{\prime}\right)$ with $c^{\prime}>c$ that writes to $\mathrm{m}[\mathrm{a}]$ before $l$. But then $l \rightarrow_{c} s^{\prime}$ which means $l \rightarrow_{r h b} s^{\prime}$ which in turn implies that commit $(l)$ precedes commit $\left(s^{\prime}\right)$ by (R1), which is a contradiction.


[^0]:    ${ }^{1}$ We do not need to include memory fence operations because a full fence is semantically equivalent to an interlocked operation to a location that is not accessed anywhere else.

