Power and Reliability in Extreme Scale Computing


March 19, 2013


Amin Ansari


University of Illinois at Urbana-Champaign


From the time of its inception, the semiconductor process has witnessed an unhindered growth in transistor integration levels. However, in the forthcoming CMOS technology generations, this aggressive scaling poses critical reliability issues due to the increasing power density and process variation. To address these challenges, this talk will present three reliable, energy-efficient solutions for network-on-chip, on-chip caches, and processor pipeline. Dynamic voltage scaling is commonly used to reduce the power consumption. However, the supply voltage cannot be reduced below a certain threshold without addressing failures. Tangle monitors the error rate observed in the network and, based on its value across different network routes, selectively increases or decreases the voltage of individual voltage domains. With Tangle, the voltage of the different domains continuously adapt to the most energy-efficient, error-free conditions. Next, I present Archipelago, a highly flexible cache design that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. An adapted version of minimum clique covering algorithm is used to minimize the amount of space lost in the cache when operating in the low-voltage region. With proper solutions in place for network-on-chip and caches, a robust and heterogeneous core coupling execution scheme, Necromancer, is presented to protect the general core area against failures. Although a faulty core cannot be trusted, for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Consequently, Necromancer exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior.


Amin Ansari

Amin Ansari is currently a National Science Foundation Computing Innovation Fellow and Postdoctoral Research Associate in the Computer Science Department of the University of Illinois at Urbana-Champaign, working with Prof. Josep Torrellas. His research interests lie in the area of computer architecture with more focus on reliability and low-power design. He is working on microarchitectural solutions for on-chip caches, processor pipeline, and network-on-chip to tackle deep sub-micron technology challenges such as power density, process variation, manufacturing defects, and wearout. He received the Ph.D. degree in Computer Science and Engineering from the University of Michigan under Prof. Scott Mahlke in 2011. He received the B.S. degree in computer engineering from Sharif University of Technology in 2007. In addition, Amin has published more than 20 papers in top-tier journals and international conferences such as IEEE Transactions on Computers, ISCA, HPCA, MICRO, and DSN. His academic achievements were recognized by 2010 College of Engineering Distinguished Achievement Award during his graduate studies at the University of Michigan. He received the best paper award at the 27th IEEE International Conference on Computer Design in 2009.