2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption. This work focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. To appear.