Solutions – Christian Engelmann, Ph.D.

System Software for the Instrument-to-Edge-to-Center Computing Continuum

The INTERSECT Federated Architecture for the Laboratory of the Future: The open Interconnected Science Ecosystem (INTERSECT) architecture connects scientific instruments and robot-controlled laboratories with computing and data resources at the edge, the Cloud or the high-performance computing center to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery and evaluation. Its a novel approach consists of science use case design patterns, a system of systems architecture, and a microservice architecture.

Lightweight Simulation of Future-Generation Extreme-Scale Supercomputers

xSim: The Extreme-scale Simulator: The Extreme-scale Simulator (xSim) runs native high-performance computing applications with millions of concurrent execution threads in a controlled environment, while observing performance and resilience in a simulated extreme-scale system for application-architecture co-design. It is able to simulate a supercomputer with 134,217,728 (2²⁷) processes using a 960-core Linux cluster. It also offers process failure and soft error injection to study propagation, detection, notification and mitigation.

High-Performance Computing Resilience

Resilience Design Patterns: Resilience design patterns offer a new, structured hardware/software design approach for improving resilience by identifying and evaluating repeatedly occurring resilience problems and coordinating corresponding solutions. They permit resilience to become an integral part of the high-performance computing hardware/software ecosystem through co-design, such that the burden for providing resilience is on the system by design and not on the operator or user as an afterthought.
Characterization of Faults, Errors, and Failures in Extreme-Scale Systems: Understanding high-performance computing system reliability characteristics requires analyzing system records. The amounts of data and the complexity of the underlying problem are often beyond what can be processed and comprehended by humans. This solution offers a taxonomy, a catalog and models that capture the observed and inferred fault, error and failure conditions in current supercomputers and extrapolates this knowledge to future systems.
redMPI: A Redundant Message Passing Interface Implementation: RedMPI enables transparent redundant execution of Message Passing Interface (MPI) applications to protect against silent data corruption, i.e., undetected bit flips. It can be also used as a fault injection tool by disabling the online error correction and keeping replicas isolated, comparing error-free and erroneous execution using the online error detection mechanism to track propagation of corrupt messages.
Proactive Fault Tolerance Framework: Proactive fault tolerance is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from components that are about to fail. The proactive fault tolerance framework consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring, and online/offline system health analysis.
Hybrid Full/Incremental System-level Checkpointing: This operating system and runtime environment solution combines scalable group membership management, reuse of network connections, transparent coordinated checkpoint scheduling, a job pause feature, and full/incremental checkpointing. The job pause allows compute nodes to remain active and roll back parallel applications to the last checkpoint. The hybrid checkpointing alternates between full and incremental checkpoints, where only data is captured that changed since the last checkpoint.
Symmetric Active/Active High Availability for HPC System Services: This solution provides high availability for high-performance computing system services with efficient redundancy strategies for head and service nodes. It relies on virtual synchrony, i.e., state-machine replication, utilizing a process group communication system for service group membership management and for reliable, totally ordered message delivery. The proof-of-concept prototypes offer 99.9997% high availability for the Torque resource manager and the Parallel Virtual File System Metadata Server.