Ongoing Research Activities
This project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software.
This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems.
Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.
Resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for exascale high-performance computing (HPC) systems
Past Research Activities
HPC resilience co-design toolkit evaluating the resilience/power/performance cost/benefit trade-off of resilience solutions, identifying hardware/software resilience properties, and coordinating interfaces/responsibilities of individual hardware/software components
HPC hardware/software co-design toolkit evaluating the performance of algorithms on future HPC architectures at extreme scale with up to 134,217,728 (2^27) processor cores [U.S. Department of Energy Extreme-scale Algorithms and Software Institute (EASI)]
HPC checkpoint storage virtualization to improve efficiency by aggregating a variety of resources, such as memory, SSDs and disks; MPI process-level software redundancy to eliminate rollback/recovery in HPC; software-based ECC to enhance soft error protection; and soft-error injection tools to study the vulnerability of science applications and of CMOS logic in processors and memory
2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond
Scalable HPC system monitoring, reliability analysis of components and full systems, fault prediction, proactive fault tolerance using prediction-triggered migration (process and virtual machine), incremental checkpoint/restart, and holistic fault tolerance (checkpoint/restart + migration) [U.S. Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]
Light-weight simulation of future HPC architectures at extreme scale (~100,000,000 cores) to evaluate the scalability and fault tolerance of key science algorithms [U.S. Department of Energy Institute for Advanced Architecture and Algorithms]
Enhancing productivity for scientific application development, deployment and execution by offering a common view across diverse HPC hardware and software platforms
Virtual system environments for “plug-and-play” supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor technologies
HPC reliability, availability and serviceability solutions, such as scalable membership management for MPI and asymmetric active/standby (n+m) replication for head and service nodes [U.S. Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]
High availability for services running on HPC head and service nodes, such as Torque and PVFS MDS, using symmetric active/active (state-machine) replication with 99.9997% service uptime
Light-weight simulation of future HPC architectures (~1,000,000 processors) to evaluate scalability/fault tolerance of a new generation of super-scalable, naturally fault-tolerant scientific algorithms [IBM CRADA]
Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI