Hybrid Full/Incremental System-level Checkpointing

June 27th, 2013

Checkpoint/restart has become a requirement for long-running parallel jobs in large-scale high-performance computing (HPC) systems due to a mean-time-to-failure (MTTF) in the order of hours. After a failure, checkpoint/restart mechanisms generally require a complete restart of a Message Passing Interface (MPI) job from the last saved checkpoint. A complete restart, however, is unnecessary since all but one compute node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. Moreover, system-level checkpointing solutions capture full process images, even though only a subset of the process image changes between checkpoints.

The developed proof-of-concept prototype includes enhancements in support of scalable group communication for membership management, reuse of network connections, transparent coordinated checkpoint scheduling, a job pause feature, and full/incremental checkpointing. It is based on the Local Area Multicomputer MPI implementation (LAM/MPI) and the Berkeley Lab Checkpoint/Restart (BLCR) solution. The transparent mechanism for job pause allows live nodes to remain active and roll back to the last checkpoint, while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. A minimal overhead of 5.6% is incurred in case migration takes place, while the regular checkpoint overhead remains unchanged. The hybrid checkpointing technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts.

Membership stabilization after a failure

Incremental checkpoint file structure

Hybrid full/incremental checkpoint savings

