2015-…: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

September 1st, 2017

Extreme-scale, high-performance computing (HPC) significantly advances discovery in fundamental scientific processes by enabling multiscale simulations that range from the very small, on quantum and atomic scales, to the very large, on planetary and cosmological scales. Computing at scales in the hundreds of petaflops, exaflops—quintillions (billion billions) operations per second—, and beyond will also lend a competitive advantage to the US energy and industrial sectors by providing the computing power for rapid design and prototyping and big data analysis.

To build and effectively operate extreme-scale HPC systems, the US Department of Energy cites several key challenges, including resilience, or efficient and correct operation despite the occurrence of faults or defects in system components that can cause errors. These innovative systems require equally innovative components designed to communicate and compute at unprecedented rates, scales, and levels of complexity, increasing the probability for hardware and software faults.

This research project offers a structured hardware and software design approach for improving resilience in extreme-scale HPC systems so that scientific applications running on these systems generate accurate solutions in a timely and efficient manner. Frequently used in computer engineering, design patterns identify problems and provide generalized solutions through reusable templates.

Using a novel resilience design pattern concept, this project identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout hardware and software components in HPC systems. This effort will create comprehensive methods and metrics by which system vendors and computing centers can establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components and optimize the cost-benefit trade-offs among performance, resilience, and power consumption. Reusable programming templates of these patterns will offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different design trade-offs. For more information, please visit ornlwiki.atlassian.net/wiki/display/RDP.

Current resilience design pattern specification: Version 1.1

Funding Sources

Participating Institutions

Important Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

  1. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Journal of Supercomputing Frontiers and Innovations (JSFI), volume 4, number 3, 2017. South Ural State University Chelyabinsk, Russia. ISSN 2409-6008. To appear. BibTeX Citation
  2. Saurabh Hukerikar and Christian Engelmann. Pattern-based Modeling of High-Performance Computing Resilience. In Lecture Notes in Computer Science: Proceedings of the 23rd European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops: 10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Santiago de Compostela, Spain, August 29, 2017. Springer Verlag, Berlin, Germany. Acceptance rate 66.7% (4/6). To appear. Abstract Presentation BibTeX Citation
  3. Saurabh Hukerikar and Christian Engelmann. A Pattern Language for High-Performance Computing Resilience. In Proceedings of the 22nd European Conference on Pattern Languages of Programs (EuroPLoP) 2017, Kloster Irsee, Germany, July 12-16, 2017. ACM Press, New York, NY, USA. To appear. Abstract BibTeX Citation
  4. Saurabh Hukerikar, Rizwan Ashraf, and Christian Engelmann. Towards New Metrics for High-Performance Computing Resilience. In Proceedings of the 26th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2017: 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2017, pages 23-30, Washington, D.C., June 26-30, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5001-3. Acceptance rate 83.3% (5/6). Abstract Publication Presentation BibTeX Citation DOI Link
  5. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.1). Technical Report, ORNL/TM-2016/767, Oak Ridge National Laboratory, Oak Ridge, TN, USA, December, 2016. Abstract Publication BibTeX Citation
  6. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.0). Technical Report, ORNL/TM-2016/687, Oak Ridge National Laboratory, Oak Ridge, TN, USA, October, 2016. Abstract Publication BibTeX Citation
  7. Saurabh Hukerikar and Christian Engelmann. Language Support for Reliable Memory Regions. In Lecture Notes in Computer Science: Proceedings of the 29th International Workshop on Languages and Compilers for Parallel Computing, pages 73-87, Rochester, NY, USA, September 28-30, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-52708-6. ISSN 0302-9743. Acceptance rate 76.9% (20/26). Abstract Publication Presentation BibTeX Citation DOI Link
  8. Saurabh Hukerikar and Christian Engelmann. Havens: Explicit Reliable Memory Regions for HPC Applications. In Proceedings of the 20th IEEE High Performance Extreme Computing Conference (HPEC) 2016, pages 1-6, Waltham, MA, USA, September 13-15, 2016. IEEE Computer Society, Los Alamitos, CA, USA. Abstract Publication Presentation BibTeX Citation DOI Link
Comments are closed.