May 27th, 2015

Ongoing Research Activities

2015-…: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

This project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software.

2015-…: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems.

2013-…: Hobbes – OS and Runtime Support for Application Composition

Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.

2013-…: MCREX – Monte Carlo Resilient Exascale Solvers

Resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for exascale high-performance computing (HPC) systems

Past Research Activities

2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing

HPC resilience co-design toolkit evaluating the resilience/power/performance cost/benefit trade-off of resilience solutions, identifying hardware/software resilience properties, and coordinating interfaces/responsibilities of individual hardware/software components

2011-12: Extreme-scale Algorithms and Software Institute

HPC hardware/software co-design toolkit evaluating the performance of algorithms on future HPC architectures at extreme scale with up to 134,217,728 (2^27) processor cores [U.S. Department of Energy Extreme-scale Algorithms and Software Institute (EASI)]

2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems

HPC checkpoint storage virtualization to improve efficiency by aggregating a variety of resources, such as memory, SSDs and disks; MPI process-level software redundancy to eliminate rollback/recovery in HPC; software-based ECC to enhance soft error protection; and soft-error injection tools to study the vulnerability of science applications and of CMOS logic in processors and memory

2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

Scalable HPC system monitoring, reliability analysis of components and full systems, fault prediction, proactive fault tolerance using prediction-triggered migration (process and virtual machine), incremental checkpoint/restart, and holistic fault tolerance (checkpoint/restart + migration) [U.S. Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]

2008-11: Scalable Algorithms for Petascale Systems with Multicore Architectures

Light-weight simulation of future HPC architectures at extreme scale (~100,000,000 cores) to evaluate the scalability and fault tolerance of key science algorithms [U.S. Department of Energy Institute for Advanced Architecture and Algorithms]

2006-09: Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms

Enhancing productivity for scientific application development, deployment and execution by offering a common view across diverse HPC hardware and software platforms

2006-08: Virtualized System Environments for Petascale Computing and Beyond

Virtual system environments for “plug-and-play” supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor technologies

2004-07: MOLAR – Modular Linux and Adaptive Runtime Support for High-End Computing

HPC reliability, availability and serviceability solutions, such as scalable membership management for MPI and asymmetric active/standby (n+m) replication for head and service nodes [U.S. Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]

2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing

High availability for services running on HPC head and service nodes, such as Torque and PVFS MDS, using symmetric active/active (state-machine) replication with 99.9997% service uptime

2002-04: Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures

Light-weight simulation of future HPC architectures (~1,000,000 processors) to evaluate scalability/fault tolerance of a new generation of super-scalable, naturally fault-tolerant scientific algorithms [IBM CRADA]

2000-05: Harness – Heterogeneous Distributed Computing

Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI

Comments are closed.