2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems
This project aims at developing a soft error resilience strategy for future-generation high-performance computing (HPC) systems. Soft errors are becoming the predominant source of interruptions in large-scale HPC systems. Double-error detection (DED) events that normally occur in a memory module with single-error correction (SEC) error correcting code (ECC) once within 1-2 million hours of operation can cause an error rate of 10-20 hours in a system with 100,000 modules. Moreover, vendors have warned that silent data corruption (SDC), i.e., undetected bit flips, are becoming a problem as well. This project targets two different solutions aiming at alleviating the issue of soft errors in large-scale HPC systems: (1) checkpoint storage virtualization to significantly improve checkpoint/restart times, and (2) software dual-modular redundancy (DMR) to eliminate rollback/recovery in HPC. The checkpoint storage virtualization aggregates a variety of back-end resources, such as flash, memory, or both, and uses them in conjunction with traditional parallel file systems. Applications are able to use it in a seamless fashion through the standard file system interface with high read/write throughput. The core concept of the DMR technology relies on software-level replication of computational processes using the sate-machine replication approach and on process cloning technology for fast recovery.
Solutions
Participating Institutions
Funding Sources
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Important Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation,
DOI Link
- Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 21% (118/569).
- Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192.
- David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 251-261, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29740-3. Acceptance rate 60.0% (12/20).
- Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9.
- Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253).