2015-…: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

December 11th, 2018

US Department of Energy (DOE) leadership computing facilities are in the process of deploying extreme-scale high-performance computing (HPC) systems with the long-range goal of building exascale systems that perform more than a quintillion (a billion billion) operations per second. More powerful computers mean researchers can simulate biological, chemical, and other physical interactions with an unprecedented amount of realism. However, as HPC systems become more complex, system integrators, component manufacturers as well as computing facilities have to and are preparing for unique computing challenges. Of particular concern are occurrences of unfamiliar or more frequent faults in both hardware technologies and software applications that can lead to computational errors or system failures.

This project will help DOE computing facilities protect extreme-scale systems by characterizing potential faults and creating models that predict their propagation and impact. The Collaboration of Oak Ridge, Argonne and Lawrence Livermore National Laboratories (CORAL) is a private/public partnership that will stand up three extreme-scale systems in 2017/2018, each operating at about 150 to 200 petaflops, or nearly 10 times more power than the 27-petaflop Titan at Oak Ridge National Laboratory (currently the fastest system in the United States) and about a tenth of exascale power.

By monitoring hardware and software performance on current DOE systems, such as Titan, and applying the data to fault analysis and vulnerability studies, this effort will capture observed and inferred fault conditions and extrapolate this knowledge to CORAL and other extreme-scale systems. Using these analyses, the project team will create assessment tools, including a fault taxonomy and catalog as well as fault models, to provide computing facilities with a clear picture of the fault characteristics in DOE computing environments and inform technical and operational decisions to improve resilience. The catalog, models, and the software resulting from this project, will be made publicly available. For more information, please visit ornlwiki.atlassian.net/wiki/display/CFEFIES.

In the News

2018-11-19: HPCwire: What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

Funding Sources

Participating Institutions

Peer-reviewed Conference Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  2. Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  3. Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937. Acceptance rate 18.7% (61/327). Abstract Publication Presentation BibTeX Citation
  4. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 2375-0227. DOI 10.1109/MASCOTS.2017.12. Acceptance rate 30.95% (26/84). Abstract Publication BibTeX Citation

Peer-reviewed Workshop Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  2. Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 39-48, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00008. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  3. Byung Hoon (Hoony) Park, Yawei Hui, Swen Boehm, Rizwan Ashraf, Christian Engelmann, and Christopher Layton. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log. In Proceedings of the 19th IEEE International Conference on Cluster Computing (Cluster) 2018: 5th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018, pages 571-579, Belfast, UK, September 10, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-8319-4. ISSN 2168-9253. DOI 10.1109/CLUSTER.2018.00073. Abstract Publication Presentation BibTeX Citation
  4. Byung Hoon (Hoony) Park, Saurabh Hukerikar, Christian Engelmann, and Ryan Adamson. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In Proceedings of the 18th IEEE International Conference on Cluster Computing (Cluster) 2017: 4th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2017, pages 758-765, Honolulu, HI, USA, September 5, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2327-5. ISSN 2168-9253. DOI 10.1109/CLUSTER.2017.113. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Conference Posters

Symbols: Abstract Abstract, Poster Poster, BibTeX Citation BibTeX Citation

  1. Yawei Hui, Rizwan Ashraf, Byung-Hoon Park, and Christian Engelmann. Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing. Poster at the 6th IEEE International Conference on Big Data (BigData) 2018, Seattle, WA, USA, October 21, 2018. Abstract Publication BibTeX Citation
  2. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Summarizing HPC System Status. Poster at the 8th IEEE Symposium on Large Data Analysis and Visualization in conjunction with the 8th IEEE Vis 2018, Berlin, Germany, October 21, 2018. Abstract Publication BibTeX Citation

White Papers

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Devesh Tiwari, Saurabh Gupta, and Christian Engelmann. Lightweight, Actionable Analytical Tools Based on Statistical Learning for Efficient System Operations. White paper submitted to the U.S. Department of Energy's Workshop on Modeling & Simulation of Systems & Applications (ModSim) 2016, August 10-12, 2016. Abstract Publication Presentation BibTeX Citation

Talks and Lectures

Symbols: Abstract Abstract, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the Platform for Advanced Scientific Computing (PASC) Conference 2018, Basel, Switzerland, July 2-4, 2018. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 6th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Zurich, Switzerland, June 20-21, 2018. Abstract Presentation BibTeX Citation
  3. Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the SIAM Annual Meeting (AM) 2017, Pittsburgh, PA, USA, July, 2017. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems. Invited talk at the International Supercomputing Conference (ISC) 2017, Frankfurt am Main, Germany, June 16-22, 2017. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 12th Scheduling for Large Scale Systems Workshop (SLSSW) 2017, Knoxville, TN, USA, May 24-26, 2017. Abstract Presentation BibTeX Citation
  6. Christian Engelmann. The Missing High-Performance Computing Fault Model. Invited talk at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France, April 12-15, 2016. Abstract Presentation BibTeX Citation
  7. Christian Engelmann. Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems. Keynote talk at the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 21st European Conference on Parallel and Distributed Computing (Euro-Par) 2015, Vienna, Austria, August 24-28, 2015. Abstract Presentation BibTeX Citation

Comments are closed.