This project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software. [US Department of Energy Early Career Research Program]
This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems. [US Department of Energy Resilience for Extreme Scale Supercomputing Systems Program]
Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.
This project develops resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for efficiently executing next-generation computational science applications on exascale high-performance computing (HPC) systems. It extends initial work in Monte Carlo Synthetic Acceleration (MCSA) and evaluates the developed solvers in a simulated extreme-scale system with realistic synthetic faults.