2012-…: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption. This work focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. To appear.