redMPI – A Redundant MPI
As systems scale up in component count and nanometer process technology shrinks, soft errors are becoming the predominant source of interruptions in large-scale high-performance computing (HPC) systems. Double-error detection (DED) events that normally occur in a memory module with single-error correction (SEC) error correcting code (ECC) once within 1-2 million hours of operation can cause an error rate of 10-20 hours in a system with 100,000 modules. Moreover, vendors have warned that silent data corruption (SDC), i.e., undetected bit flips, are becoming a problem as well. In general, software redundancy is able to transparently mask reported errors, such as detected hard and soft errors, without recovery. It is also able to detect silent errors, like SDC, through comparison and recover them using voting if more than two replicas exist.
RedMPI is a prototype that enables transparent redundant execution of Message Passing Interface (MPI) applications. It is based on two earlier prototypes, MR-MPI, developed by Oak Ridge National Laboratory and rMPI, developed by Sandia National Laboratory. RedMPI sits between the MPI library and the MPI application, utilizing the MPI performance tool interface, PMPI, to intercept MPI calls from the application and to hide all redundancy-related mechanisms. A redundantly executed application runs with r*m MPI processes, where r is the number of MPI ranks visible to the application and m is the replication degree. RedMPI supports partial replication, e.g., a degree of 2.5 instead of 2 or 3, for tunable resilience. It also supports a variety of message-based replication protocols with different consistency. Results indicate that the most efficient consistency protocol can successfully protect HPC applications even from high SDC rates with runtime overheads between 0% and 30%, compared to unprotected applications without redundancy.
RedMPI can be also used as a fault injection tool by disabling the online error correction and keeping replicas isolated. A failure free execution can be compared to the redundant execution with an injected fault using the online error detection mechanism to track propagation of corrupt messages. Depending on the application properties, a single bit flip can corrupt all MPI processes of an application within a short period of time, or may be corrected by the application’s computational structure, such as by an iterative algorithm.
Message+hash replication protocol
Injecting bit-flips into the NAS LU benchmark
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472).
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13% (71/515).
- Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192.
- Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9.
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0.