2012-…: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption. This work focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
Funding source: Laboratory Directed Research and Development, Oak Ridge National Laboratory