Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
L. Tang, V. Kumar, M. Ngaw, S. Singh, D. Nadkarni, L. Tammala, K. Mai and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2025)
Towards an Algorithm-based Approach for Soft Error Tolerance using Interval Arithmetic
Comment: Best Paper Award
Preprint (8.9 MB)
Bibtex
Soft errors pose a critical reliability concern for modern electronics both in space and terrestrial applications. The traditional hardware redundancy approaches such as triple modular redundancy or dual modular redundancy introduce significant overheads, especially in large modern SoCs which must consider tradeoffs between power, performance, area, and reliability requirements. We propose a new approach for hardware redundancy based on the ideas behind algorithm-based fault tolerance (ABFT). Rather than replicating entire logic modules in the design, we propose using floating-point interval arithmetic to realize redundancy in computational datapaths. Error detection is then performed by leveraging the guarantees provided by interval arithmetic and forward error analysis of the specific algorithm. We demonstrate the technique for protection of a hardware FFT datapath and a systolic array. To evaluate the approach, a silicon test chip is fabricated in a 28nm process and post-PnR/simulation results demonstrate up to 2-3 times savings in area.
Keywords: Triple-modular redundancy, Interval arithmetic, Fault tolerance, Radiation-hardening