2013-16: Hobbes – OS and Runtime Support for Application Composition
This project intends to deliver an operating system and runtime (OS/R) environment for extreme-scale scientific computing. With application composition as the fundamental driving force, we will develop the necessary OS/R interfaces and low-level system services required to support the isolation and sharing needed to design and implement applications, as well as, performance and correctness tools. Our approach also will support complex simulation and analysis workflows. A workflow’s components will likely consist of a wide range of parallel codes with different OS/R requirements, e.g., a relatively complicated multi-physics workflow that incorporates data from three different types of legacy codes that use MPI only, PGAS languages, and MPI with threading, and requires components for analytics, visualization, uncertainty quantification, memory profiling, and performance analysis. Instead of a single unified OS/R to support every conceivable requirement, we propose a lightweight OS/R system with the flexibility to custom build runtimes for any particular purpose. Each component executes in its own “enclave” with a specialized runtime and isolation properties. A global runtime system provides the software required to compose applications out of a collection of enclaves, join them through secure and low-latency communication, and schedule them to avoid contention and maximize resource utilization. The benefits gained from lightweight and customizable runtimes include predictable and consistent memory and network patterns, manageable resilience properties, and measurable power and energy characteristics. These benefits simplify algorithm design and development issues at a large scale. Project deliverables are: (1) a OS/R stack based on the Kitten OS and Palacios virtual machine monitor and (2) high-value, high risk research that leverages the architecture of the base OS/R to explore issues of specific interest to exascale, e.g., virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools. For more information, please visit xstack.sandia.gov/hobbes.
Funding Sources
- Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Participating Institutions
- Sandia National Laboratories
- Georgia Institute of Technology
- Indiana University
- Los Alamos National Laboratory
- Lawrence Berkeley National Laboratory
- North Carolina State University
- Northwestern University
- Oak Ridge National Laboratory
- University of Pittsburgh
- University of Arizona
- University of California Berkeley
- University of New Mexico
- University of Texas at El Paso
- University of Tennessee, Knoxville
Important Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation,
DOI Link
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus for Extreme Parallelism. International Journal of High Performance Computing Applications (IJHPCA), volume 32, number 5, pages 729-743, 2018. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342017690910.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. A Cooperative Approach to Virtual Machine Based Fault Injection. In Lecture Notes in Computer Science: Proceedings of the 22nd European Conference on Parallel and Distributed Computing (Euro-Par) 2016 Workshops: 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 671-682, Grenoble, France, August 23, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-58943-5. ISSN 0302-9743. Acceptance rate 55.6% (5/9).
- David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. Acceptance rate 24.2% (43/178).
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3.
- Thomas Naughton, Garry Smith, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. What is the right balance for performance and isolation with virtualization in HPC?. In Lecture Notes in Computer Science: Proceedings of the 20th European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops: 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 570-581, Porto, Portugal, August 25, 2014. Springer Verlag, Berlin, Germany. ISBN 978-3-319-14325-5. ISSN 0302-9743. Acceptance rate 60.0% (6/10).