Current Research

June 9th, 2016
2015-…: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

This project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software. [US Department of Energy Early Career Research Program]

2015-…: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems. [US Department of Energy Resilience for Extreme Scale Supercomputing Systems Program]

2013-…: Hobbes – OS and Runtime Support for Application Composition

Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.

2013-…: MCREX – Monte Carlo Resilient Exascale Solvers

This project develops resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for efficiently executing next-generation computational science applications on exascale high-performance computing (HPC) systems. It extends initial work in Monte Carlo Synthetic Acceleration (MCSA) and evaluates the developed solvers in a simulated extreme-scale system with realistic synthetic faults.

Comments are closed.