2015-…: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

June 15th, 2017

US Department of Energy (DOE) leadership computing facilities are in the process of deploying extreme-scale high-performance computing (HPC) systems with the long-range goal of building exascale systems that perform more than a quintillion (a billion billion) operations per second. More powerful computers mean researchers can simulate biological, chemical, and other physical interactions with an unprecedented amount of realism. However, as HPC systems become more complex, system integrators, component manufacturers as well as computing facilities have to and are preparing for unique computing challenges. Of particular concern are occurrences of unfamiliar or more frequent faults in both hardware technologies and software applications that can lead to computational errors or system failures.

This project will help DOE computing facilities protect extreme-scale systems by characterizing potential faults and creating models that predict their propagation and impact. The Collaboration of Oak Ridge, Argonne and Lawrence Livermore National Laboratories (CORAL) is a private/public partnership that will stand up three extreme-scale systems in 2017/2018, each operating at about 150 to 200 petaflops, or nearly 10 times more power than the 27-petaflop Titan at Oak Ridge National Laboratory (currently the fastest system in the United States) and about a tenth of exascale power.

By monitoring hardware and software performance on current DOE systems, such as Titan, and applying the data to fault analysis and vulnerability studies, this effort will capture observed and inferred fault conditions and extrapolate this knowledge to CORAL and other extreme-scale systems. Using these analyses, the project team will create assessment tools, including a fault taxonomy and catalog as well as fault models, to provide computing facilities with a clear picture of the fault characteristics in DOE computing environments and inform technical and operational decisions to improve resilience. The catalog, models, and the software resulting from this project, will be made publicly available. For more information, please visit ornlwiki.atlassian.net/wiki/display/CFEFIES.

Funding Sources

Participating Institutions

Important Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

  1. Saurabh Gupta, Devesh Tiwari, Tirthak Patel, and Christian Engelmann. Reliability of HPC systems: Large-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, Denver, CO, USA, November 12-17, 2017. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 18.7% (61/327). Abstract BibTeX Citation
  2. Bin Nie, Devesh Tiwari, Ji Xue, Saurabh Gupta, Christian Engelmann, and Evgenia Smirni. Exploring and Exploiting the Interplay among Temperature, Power, and GPU Errors. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. Abstract BibTeX Citation
  3. Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. Acceptance rate 22.4% (58/259). Abstract Publication BibTeX Citation DOI Link
  4. Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. Acceptance rate 23.0% (114/496). Abstract Publication Presentation BibTeX Citation DOI Link
  5. Devesh Tiwari, Saurabh Gupta, and Christian Engelmann. Lightweight, Actionable Analytical Tools Based on Statistical Learning for Efficient System Operations. White paper submitted to the U.S. Department of Energy's Workshop on Modeling & Simulation of Systems & Applications (ModSim) 2016, August 10-12, 2016. Abstract Publication Presentation BibTeX Citation
Comments are closed.