Skip to content

redMPI: A Redundant Message Passing Interface Implementation

Summary: RedMPI enables transparent redundant execution of Message Passing Interface (MPI) applications to protect against silent data corruption, i.e., undetected bit flips. It can be also used as a fault injection tool by disabling the online error correction and keeping replicas isolated, comparing error-free and erroneous execution using the online error detection mechanism to track propagation of corrupt messages.

As systems scale up in component count and nanometer process technology shrinks, soft errors are becoming the predominant source of interruptions in large-scale high-performance computing (HPC) systems. Double-error detection (DED) events that normally occur in a memory module with single-error correction (SEC) error correcting code (ECC) once within 1-2 million hours of operation can cause an error rate of 10-20 hours in a system with 100,000 modules. Moreover, vendors have warned that silent data corruption (SDC), i.e., undetected bit flips, are becoming a problem as well. In general, software redundancy is able to transparently mask reported errors, such as detected hard and soft errors, without recovery. It is also able to detect silent errors, like SDC, through comparison and recover them using voting if more than two replicas exist.

RedMPI is a prototype that enables transparent redundant execution of Message Passing Interface (MPI) applications. It is based on two earlier prototypes, MR-MPI, developed by Oak Ridge National Laboratory and rMPI, developed by Sandia National Laboratory. RedMPI sits between the MPI library and the MPI application, utilizing the MPI performance tool interface, PMPI, to intercept MPI calls from the application and to hide all redundancy-related mechanisms (Figure 1). A redundantly executed application runs with r*m MPI processes, where r is the number of MPI ranks visible to the application and m is the replication degree. RedMPI supports partial replication, e.g., a degree of 2.5 instead of 2 or 3, for tunable resilience. It also supports a variety of message-based replication protocols with different consistency (Figure 2). Results indicate that the most efficient consistency protocol can successfully protect HPC applications even from high SDC rates with runtime overheads between 0% and 30%, compared to unprotected applications without redundancy.

RedMPI can be also used as a fault injection tool by disabling the online error correction and keeping replicas isolated (Figure 3). A failure free execution can be compared to the redundant execution with an injected fault using the online error detection mechanism to track propagation of corrupt messages. Depending on the application properties, a single bit flip can corrupt all MPI processes of an application within a short period of time, or may be corrected by the application’s computational structure, such as by an iterative algorithm.


Figure 1: redMPI Architecture

Figure 2: Message+hash replication protocol

Figure 3: Injecting bit-flips into the NAS LU benchmark

Research Projects

Funding Sources

Participating Institutions

Peer-reviewed Conference Publications

  1. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). Abstract Publication Presentation BibTeX Citation
  2. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). Abstract Publication Presentation BibTeX Citation
  3. Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192. DOI 10.1109/PDP.2012.22. Abstract Publication Presentation BibTeX Citation
  4. Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9. DOI 10.2316/P.2011.719-031. Abstract Publication Presentation BibTeX Citation
  5. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Workshop Publications

  1. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 251-261, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29740-3. DOI 10.1007/978-3-642-29740-3_29. Acceptance rate 60.0% (12/20). Abstract Publication BibTeX Citation

Peer-reviewed Conference Posters

  1. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, and Kurt Ferreira. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation
  2. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation

Technical Reports

  1. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Technical Report, ORNL/TM-2012/227, Oak Ridge National Laboratory, Oak Ridge, TN, USA, June 1, 2012. Abstract Publication BibTeX Citation

Talks and Lectures

  1. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015. Abstract Presentation BibTeX Citation
  3. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. Resilience for Permanent, Transient, and Undetected Errors. Invited talk at the 16th Workshop on Distributed Supercomputing (SOS) 2012, Santa Barbara, CA, USA, March 12-15, 2012. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011. Abstract Presentation BibTeX Citation
  6. Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011. Abstract Presentation BibTeX Citation
  7. Christian Engelmann. Modular Redundancy for Soft-Error Resilience in Large-Scale HPC Systems. Invited talk at the Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009. Abstract Presentation BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation