Hybrid Full/Incremental System-level Checkpointing
Checkpoint/restart has become a requirement for long-running parallel jobs in large-scale high-performance computing (HPC) systems due to a mean-time-to-failure (MTTF) in the order of hours. After a failure, checkpoint/restart mechanisms generally require a complete restart of a Message Passing Interface (MPI) job from the last saved checkpoint. A complete restart, however, is unnecessary since all but one compute node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. Moreover, system-level checkpointing solutions capture full process images, even though only a subset of the process image changes between checkpoints.
The developed proof-of-concept prototype includes enhancements in support of scalable group communication for membership management, reuse of network connections, transparent coordinated checkpoint scheduling, a job pause feature, and full/incremental checkpointing. It is based on the Local Area Multicomputer MPI implementation (LAM/MPI) and the Berkeley Lab Checkpoint/Restart (BLCR) solution. The transparent mechanism for job pause allows live nodes to remain active and roll back to the last checkpoint, while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. A minimal overhead of 5.6% is incurred in case migration takes place, while the regular checkpoint overhead remains unchanged. The hybrid checkpointing technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts.
Membership stabilization after a failure
Incremental checkpoint file structure
Hybrid full/incremental checkpoint savings
- Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. Acceptance rate 26% (109/419).
- Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. Acceptance rate 26.2% (37/141).