2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond
This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24×7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance. For more information, please visit www.fastos.org/ras.
Solutions
Participating Institutions
Funding Sources
- Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Important Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation,
DOI Link
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188).
- Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. Acceptance rate 19.1% (58/304).
- Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. Acceptance rate 42.0%.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. Acceptance rate 21.3% (59/277).