2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems

June 27th, 2013

This project aims at developing a soft error resilience strategy for future-generation high-performance computing (HPC) systems. Soft errors are becoming the predominant source of interruptions in large-scale HPC systems. Double-error detection (DED) events that normally occur in a memory module with single-error correction (SEC) error correcting code (ECC) once within 1-2 million hours of operation can cause an error rate of 10-20 hours in a system with 100,000 modules. Moreover, vendors have warned that silent data corruption (SDC), i.e., undetected bit flips, are becoming a problem as well. This project targets two different solutions aiming at alleviating the issue of soft errors in large-scale HPC systems: (1) checkpoint storage virtualization to significantly improve checkpoint/restart times, and (2) software dual-modular redundancy (DMR) to eliminate rollback/recovery in HPC. The checkpoint storage virtualization aggregates a variety of back-end resources, such as flash, memory, or both, and uses them in conjunction with traditional parallel file systems. Applications are able to use it in a seamless fashion through the standard file system interface with high read/write throughput. The core concept of the DMR technology relies on software-level replication of computational processes using the sate-machine replication approach and on process cloning technology for fast recovery.

Solutions

Participating Institutions

Funding Sources

Important Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

Comments are closed.