Past Research

September 11th, 2019
2018-2019 rOpenMP: A Resilient Parallel Programming Model for Heterogeneous Systems

The rOpenMP project performs research to enable fine-grain resilience for supercomputers with accelerators that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for quality of service and corresponding extensions for the for OpenMP parallel programming model.

2015-2019: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems. [US Department of Energy Resilience for Extreme Scale Supercomputing Systems Program]

2013-16: Hobbes – OS and Runtime Support for Application Composition

Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.

2013-16: MCREX – Monte Carlo Resilient Exascale Solvers

This project develops resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for efficiently executing next-generation computational science applications on exascale high-performance computing (HPC) systems. It extends initial work in Monte Carlo Synthetic Acceleration (MCSA) and evaluates the developed solvers in a simulated extreme-scale system with realistic synthetic faults.

2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing

HPC resilience co-design toolkit evaluating the resilience/power/performance cost/benefit trade-off of resilience solutions, identifying hardware/software resilience properties, and coordinating interfaces/responsibilities of individual hardware/software components

2011-12: Extreme-scale Algorithms and Software Institute

HPC hardware/software co-design toolkit evaluating the performance of algorithms on future HPC architectures at extreme scale with up to 134,217,728 (2^27) processor cores [US Department of Energy Extreme-scale Algorithms and Software Institute (EASI)]

2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems

HPC checkpoint storage virtualization to improve efficiency by aggregating a variety of resources, such as memory, SSDs and disks; MPI process-level software redundancy to eliminate rollback/recovery in HPC; software-based ECC to enhance soft error protection; and soft-error injection tools to study the vulnerability of science applications and of CMOS logic in processors and memory

2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

Scalable HPC system monitoring, reliability analysis of components and full systems, fault prediction, proactive fault tolerance using prediction-triggered migration (process and virtual machine), incremental checkpoint/restart, and holistic fault tolerance (checkpoint/restart + migration) [US Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]

2008-11: Scalable Algorithms for Petascale Systems with Multicore Architectures

Light-weight simulation of future HPC architectures at extreme scale (~100,000,000 cores) to evaluate the scalability and fault tolerance of key science algorithms [U.S. Department of Energy Institute for Advanced Architecture and Algorithms]

2006-09: Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms

Enhancing productivity for scientific application development, deployment and execution by offering a common view across diverse HPC hardware and software platforms

2006-08: Virtualized System Environments for Petascale Computing and Beyond

Virtual system environments for “plug-and-play” supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor technologies

2004-07: MOLAR – Modular Linux and Adaptive Runtime Support for High-End Computing

HPC reliability, availability and serviceability solutions, such as scalable membership management for MPI and asymmetric active/standby (n+m) replication for head and service nodes [US Department of Energy Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)]

2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing

High availability for services running on HPC head and service nodes, such as Torque and PVFS MDS, using symmetric active/active (state-machine) replication with 99.9997% service uptime

2002-04: Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures

Light-weight simulation of future HPC architectures (~1,000,000 processors) to evaluate scalability/fault tolerance of a new generation of super-scalable, naturally fault-tolerant scientific algorithms [IBM CRADA]

2000-05: Harness – Heterogeneous Distributed Computing

Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI

Comments are closed.