a tall building lit up at night

Microsoft Research Lab – Asia

Forerunner: The First Many-Future Speculative Execution Technique

November 15, 2021

SOSP is a top-tier bi-annual conference in the systems research area. In the recent SOSP 2021, researchers from MSR Asia, along with collaborators from Zhejiang University, presented a paper: “Forerunner: Constraint-Based Speculative Transaction Execution for Ethereum.” This technology is intellectually intriguing because it represents the first many-future speculative execution technique, which generalizes how pre-executions can be leveraged to accelerate the actual execution.

Forerunner paper

Link to paper: https://www.microsoft.com/en-us/research/publication/forerunner-constraint-based-speculative-transaction-execution-for-ethereum/

This paper explains how MSR researchers developed Forerunner to accelerate the transaction execution of the Ethereum blockchain. They implemented the technique as a transaction accelerator in an Ethereum node and conducted the evaluation on the blockchain’s live traffic. Results show that such a generalized construction can be very profitable in a real system. This accelerated execution capability can be used to increase Ethereum’s throughput in its foundational layer. The code and data sets are publicly available, and the results are reproducible as verified in the artifact evaluation of SOSP.

Next, we will present a recap of the basics of speculative execution and then introduce Forerunner’s core ideas.

Basics: Speculative Execution

Speculative execution has been successfully applied to many systems. This idea can be employed whenever we have a piece of code that we know will be executed later but when the input to the code, which determines the execution results, will only be fully known at the time of the execution. By making a prediction on the input, we can pre-execute the code on spare or idle resources. If the prediction is later found to be correct, the actual execution can be skipped with the pre-execution results returned, which accelerates the actual execution process. However, this does not work and incurs a checking slowdown when the prediction turns out to be wrong. Traditionally, speculative execution has been mostly applied to single-future environments, in which simply betting on the most likely future is a feasible strategy.

diagram

Figure 1: Speculative Execution Basics

Problem: Many-Future Speculative Execution

In this work, we consider the problem of many-future speculative execution. This is motivated by the opportunities and challenges presented in the transaction processing of the Ethereum blockchain. As a programmable blockchain, Ethereum transactions are specified in Turing-complete smart contracts, whose non-trivial execution is a bottleneck in the blockchain. Ethereum operates in a Dissemination-Consensus-Execution model, or in brief, a DiCE model. In this model, transactions are first disseminated through broadcasting, then accepted into a chain of blocks by a decentralized consensus process, and only after that are they executed sequentially by all the Ethereum nodes. While the dissemination-to-execution window creates a natural opportunity for speculative execution, due to the decentralized dissemination and consensus process, there are many possible orders, or ways, a transaction can be accepted into the blockchain, leading to many different possible future executions. Usually, none of these futures has a dominant probability. This means that as long as we only make a single prediction for each transaction, no matter which future we bet on each time, the accuracy would not be very high. This is a many-future situation that is challenging for the traditional single-future speculative execution technique.

Figure 2: Many-Future Challenge in Ethereum

Figure 2: Many-Future Challenge in Ethereum

Intuition: Cross-Future Acceleration

Intuitively, we would solve the many-future challenge by finding a way to allow one or very few pre-executions to cover a large percentage of the possibility space of the futures. This essentially requires cross-future acceleration, meaning we would try to leverage the pre-execution in one predicted future to accelerate the execution, not only in that specific future, but also in other futures. In this sense, traditional speculative execution only facilitates same-future acceleration. Of course, we cannot expect the acceleration speedup achieved in a cross-future fashion to be as high as in a same-future fashion. However, it is reasonable to expect that the greater the similarity between the predicted future and to the actual one, the higher the acceleration.

Once we have decided on multiple pre-executions, each in a distinct speculative future, we can then imagine another kind of cross-future acceleration, which, in some sense, is the dual of the one-to-many idea just described and refers to the ability to make use of as many pre-executions as are available to better accelerate the one true future that actually happens. This many-to-one idea is generally about breaking the pre-executions apart and reassembling the parts together to approximate the actual execution as closely as possible. The goal is to use the increased similarity to achieve higher acceleration.

Ideally, it should allow us to combinatorially compose a large number of pre-executions out of a few actual pre-executions, increasing the likelihood that we would hit on the actual future, or something close to it.

Figure 3: “One-to-Many” and “Many-to-One” Cross-Future Acceleration

Figure 3: “One-to-Many” and “Many-to-One” Cross-Future Acceleration

Insight: Defining CD-Equivalence to Enable Cross-Future Acceleration

To achieve cross-future acceleration, we would need to identify the inherent similarities connecting the futures in such a way that allows the computation done in one future to be used to accelerate the execution in another. Our key insight is to view each possible future execution as an instruction trace and employ a structural equivalence of the traces to achieve cross-future acceleration.

The structural equivalence we propose is called CD-Equivalence. Two executions are CD-Equivalent if and only if their traces have identical Control flow and Data flow. This means that the two executions follow the exact same sequence of instructions and the cross-instruction data dependencies are also exactly the same. The notion of CD-Equivalence is an equivalence relation, which partitions all possible future executions into equivalence classes. By definition, all the executions in a class follow the same CD-Equivalent instruction trace, meaning that if we could somehow turn any of the traces into an executable program, it would produce identical results as the original transaction code in all the futures of the same class. A very important and useful property of such a trace program is that being specific to a CD-Equivalence class, it is essentially a fixed sequence of unrolled and inlined instructions with fully-determined data dependencies, which makes it highly optimizable. This means that we can record the trace of the pre-execution in a speculated future and aggressively specialize the trace into a much shorter and more efficient fast path program for a whole class of futures. To make sure a fast path is only used in futures of the right class, a set of CD-Equivalence constraints are checked before its execution. In essence, a equivalence class is uniquely defined or characterized by such a set of constraints. The constraints-based fast path scheme is our core mechanism to achieve cross-future acceleration, onto which many more innovations may be added to significantly improve efficiency.

Another very important reason we chose CD-Equivalence comes from a key observation on Ethereum’s real workload. It is common for many different futures of an Ethereum transaction to have CD-Equivalent executions. A concrete example is shown in our paper to explain this insight . This observation means that the chance for the actual execution to be CD-Equivalent to one of the pre-executions—and thus can be accelerated—is high.

Figure 4: CD-Equivalence

Figure 4: CD-Equivalence

Key Techniques: Multi-Trace Specialization with Memoization

The design for our solution is a novel multi-trace speculative program specialization technique, enhanced by a new form of memoization. It traces pre-executions and specializes the traces to an accelerated program, or an AP, which consists of a piece of constraint checking code followed by a fast path program. It then adds shortcuts to the AP code that can further skip segments of the AP execution based on values remembered from pre-executions.

The following are several highlights in the design of Forerunner:

  1. The specialized AP code is much shorter than the original trace. On average, its length is less than one tenth of the original trace.
  2. A key innovation in the specialization process is to retarget specialized AP code to a hugely simplified register-only and control-flow-free VM, which has much higher execution efficiency than the original VM that needs to support the full semantics of the original code. This means that the much shorter AP code also runs in a more efficient VM.
  3. The AP execution is roll-back free. AP won’t make any externally visible changes until it is sure that the future can be accelerated by its fast paths, meaning that upon speculation failure, we can immediately fall back to the original code without reverting anything.
  4. We have an innovative way of performing merged constraint checking for APs that cover multiple equivalence classes with multiple fast paths. This ensures that the constraint checking cost does not grow with the number of classes covered.
  5. The shortcuts added are very flexible and powerful. It can skip different segments of an AP execution independently based on the values remembered from different pre-executed futures. This is how we achieve many-to-one cross-acceleration. In our evaluation, we observe that more than 80 percent of AP code can be skipped by the shortcuts. In the best case scenario of a perfect prediction, shortcuts can skip all computation instructions of an AP, which renders it almost as efficient as the traditional approach would perform. Our approach can thus be viewed as a generalization of the traditional speculative execution.
  6. Last but not the least, the specialization and memoization process is designed to run very fast so that it can generate APs in time before the actual execution needs to happen.
Figure 5: Technical Design of Forerunner

Figure 5: Technical Design of Forerunner

Implementation and Evaluation on Ethereum

To validate our approach, we implemented Forerunner as a transaction accelerator for an Ethereum node and measured the speedups achieved. We found that the implementation works reliably in a real Ethereum node and can correctly handle all the real-world Ethereum transactions and smart contracts . The highlight of our evaluation is that it was conducted in live Ethereum nodes on the live traffic of a blockchain as they happened in real time. This exposes our approach to real-world code and data, real-world uncertainties of futures, and real-world time budget for pre-execution and other pre-processing tasks, which is very important for understanding the true effectiveness of a speculative execution technique. To achieve any amount of speedup, our technique needed to finish everything, from prediction to specialization and memoization, correctly and fast enough in a live system.

Our main evaluation ran continuously for 10 days, processing 13 million transactions. The average speedup achieved over all these transactions was 6.1. Worth mentioning is that there were 5% of the transactions that were not heard by our nodes during the dissemination phase, and so there was no chance to carry out pre-execution for them in the live evaluation. If these unheard transactions were excluded from the calculations, the effective average speedup achieved by our approach would be 8.4. In contrast, the traditional approach can only achieve a speedup of 2 on the same workload, as its single-future method fails approximately 50% of the time due to the large amount of many-future transactions . Our approach, on the other hand, can accelerate almost all these many-future transactions. In total, 98.41% of the transactions were accelerated. This result shows that our more sophisticated approach is crucial to breaking through the speedup limit imposed by many-future transactions. We believe that further engineering efforts may lead to even higher speedup.

speedup

Conclusion and Future Work

By addressing the many-future challenge, Forerunner enables speculative execution to be used as a profitable building block in the construction of high-throughput blockchain systems. Before this vision can be fully realized, the following needs to be carefully studied in future work: a) tradeoffs between the costs incurred and the speedups achieved; b) further optimizations to achieve Forerunner’s full performance potential with as low costs as possible; c) problems that arise in the integration with concrete public or consortium blockchains. Hopefully, the insights behind Forerunner can inspire new techniques that are useful in scenarios beyond blockchain.