Adrian Sampson: Some Literature on Application-Level Error Exposure

Many modern resource-intensive applications can be considered “soft”: their computation is inherently approximate. A lossy image or audio compressor, for example, must deal with built-in uncertainty in its data. And if the algorithm makes a few “mistakes”—gets a few pixels wrong, for instance—the user is unlikely to notice (depending on the situation). Machine learning algorithms, vision applications, and signal processing can also exhibit this error-tolerant property.

Many recent research projects have taken advantage of this “soft” category of applications. The approaches to this problem, however, have been varied and have come from several different computer science and electrical engineering research communities. This is a compilation of different views on the issue. It’s unlikely to be complete, though, so please get in touch if you know of other relevant work!

Studies of Application-Level Error Tolerance

A few papers focus on exploring the tolerance of selected applications to transient faults. These studies run some applications under some kind of simulation infrastructure that injects faults and measure the resulting output QoS. The papers then advocate for further exploration into mechanisms for exploiting this quality.

Xuanhua Li and Donald Yeung from Maryland authored a series of papers in this vein (2006, 2007, 2008). de Kruijf (2009) focuses on disparity between critical and non-critical instructions. Wong (2006) focuses on probabilistic inference applications in particular.

The consensus among these papers is that some parts of the application (some memory regions, some instructions) are much more tolerant to error than others. For instance, corruption in a jump target pointer is likely to be catastrophic, but faults in image pixel data is usually benign.

Two of the papers mentioned above were published in a workshop called Silicon Errors in Logic: System Effects (SELSE), which is as close as this topic has to a home.

Architectural Approximation Techniques

The architecture and VLSI communities have contributed a few techniques for saving energy (and sometimes performance) with circuit- and architecture-level techniques.

One paper (Tong 2000) examines adapting the floating-point mantissa width to suit the application. Because FP computations implicitly incorporate imprecision in the form of rounding, coarsening the imprecision has a predictably mild effect on some applications. Similarly, Alvarez (2005) exploits FP operation memoization, a correctness-preserving energy-saving technique, to prove “fuzzy memoization” that compromises some accuracy and saves even more energy. A paper by Phillip Stanley-Marbell (2009) at IBM exploits number representations to mitigate the semantic effect of bit flips; the work seeks to provide guaranteed bounds on each value’s deviation from its “correct” value.

A group at Illinois proposes “stochastic processors” (Narayanan 2010), which should include logic circuits (i.e., ALUs and FPUs) that are amenable to voltage-overscaling, possibly alongside units with strict guarantees. Their group page contains a long list of related work. A paper in HPCA 2010 (Kahng 2010) is of particular interest: it proposes a design technique for processors that gracefully scale their error frequencies in the face of voltage overscaling. Their technique relies on reducing the number of near-critical paths.

“Probabilistic CMOS” (Chakrapani 2006, Akgul 2006) is a similar concept from the VLSI community that advocates codesign of the technology, architecture, and application to produce approximate ASICs for particular “soft” applications.

Joe Bates at MIT’s Media Lab purports to have a design for a very-small-area, very-low-power FPU that exhibits transient faults with low absolute value. Details are slim, and there don’t seem to be any publications from the project yet.

Compiler Techniques

On the other end of the computer science spectrum, language and compiler researchers have explored software-only optimizations that trade away strict correctness guarantees. In particular, Martin Rinard’s group at MIT proposes unsound code transformations such as “loop perforation” (Agarwal 2009). A paper at Onward! explores patterns that are amenable to this kind of transformation (Rinard 2010), and another paper proposes “quality-of-service profiling” to help programmers identify code that can be safely relaxed.

Green (Baek 2010) is a different technique that allows the programmer to write several implementations of a single function: a “precise” one and several of varying imperfect precision. A runtime system then monitors application QoS online and dynamically adapts to provide a target QoS value. The main contribution here is a approach to dynamically and holistically controlling a whole application’s output fidelity.

Language-Exposed Hardware Relaxations

Another category of approaches combines a particular architecture-level loss of accuracy with a programming construct for exploiting it. Relax (de Kruijf 2010) lets the programmer annotate regions of code for which hardware error recovery mechanisms should be turned off. The hardware still performs error detection, however, and the programmer can choose how to handle hardware faults. Flicker (Liu 2009) is distinct in its focus on soft memory rather than logic: it lets the programmer allocate some data in a failure-prone region of memory. The DRAM behind this address space then reduces its refresh rate, saving power but introducing occasional bit flips. Finally, Stanley-Marbell (2006) proposes a parallel architecture that uses language-level error bound expressions to map messages to higher- or lower-reliability communication channels.

Citations

The following citations are also available as a BibTeX file.

Anant Agarwal, Martin Rinard, Stelios Sidiroglou, Sasa Misailovic, and Henry Hoffmann. Using code perforation to improve performance, reduce energy consumption, and respond to failures. Technical report, MIT, 2009.
B.E.S. Akgul, L.N. Chakrapani, P. Korkmaz, and K.V. Palem. Probabilistic CMOS technology: A survey and future directions. In IFIP Intl. Conference on VLSI, 2006.
Carlos Alvarez, Jesus Corbal, and Mateo Valero. Fuzzy memoization for floating-point multimedia applications. IEEE Trans. Comput., 54(7), 2005.
Woongki Baek and Trishul M. Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In PLDI, 2010.
Lakshmi N. Chakrapani, Bilge E. S. Akgul, Suresh Cheemalavagu, Pinar Korkmaz, Krishna V. Palem, and Balasubramanian Seshasayee. Ultra-efficient (embedded) soc architectures based on probabilistic CMOS (PCMOS) technology. In DATE, 2006.
M. de Kruijf and K. Sankaralingam. Exploring the synergy of emerging workloads and silicon reliability trends. In SELSE, 2009.
Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. Relax: an architectural framework for software recovery of hardware faults. In ISCA, 2010.
D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: a low-power pipeline based on circuit-level timing speculation. In MICRO, 2003.
Rajamohana Hegde and Naresh R. Shanbhag. Energy-efficient signal processing via algorithmic noise-tolerance. In ISLPED, 1999.
A.B. Kahng, Seokhyeong Kang, R. Kumar, and J. Sartori. Designing a processor from the ground up to allow voltage/reliability tradeoffs. In HPCA, 2010.
Larkhoon Leem, Hyungmin Cho, Jason Bau, Quinn A. Jacobson, and Subhasish Mitra. ERSA: error resilient system architecture for probabilistic applications. In DATE, 2010.
Xuanhua Li and Donald Yeung. Exploiting soft computing for increased fault tolerance. In ASGI, 2006.
Xuanhua Li and Donald Yeung. Application-level correctness and its impact on fault tolerance. In HPCA, 2007.
Xuanhua Li and Donald Yeung. Exploiting application-level correctness for low-cost fault tolerance. Journal of Instruction-Level Parallelism, 2008.
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. Flicker: Saving refresh-power in mobile devices through critical data partitioning. Technical Report MSR-TR-2009-138, Microsoft Research, 2009.
Sasa Misailovic, Stelios Sidiroglou, Hank Hoffman, and Martin Rinard. Quality of service profiling. In ICSE, 2010.
Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. Scalable stochastic processors. In DATE, 2010.
Martin Rinard, Henry Hoffmann, Sasa Misailovic, and Stelios Sidiroglou. Patterns and statistical analysis for understanding reduced resource computing. In Onward!, 2010.
Phillip Stanley-Marbell and Diana Marculescu. A programming model and language implementation for concurrent failureprone hardware. In PMUP, 2006.
Phillip Stanley-Marbell. Encoding efficiency of digital number representations under deviation constraints. In ITW, 2009.
Jonathan Ying Fai Tong, David Nagle, and Rob. A. Rutenbar. Reducing power by optimizing the necessary precision/range of floating-point arithmetic. IEEE Trans. VLSI Syst., 8(3), 2000.
Vicky Wong and Mark Horowitz. Soft error resilience of probabilistic inference applications. In SELSE, 2006.