Symmetric Active/Active High Availability for HPC System Services

June 27th, 2013

This work paves the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a single node failure until manual repair. The approach relies on virtual synchrony, i.e., state-machine replication, utilizing a process group communication system for service group membership management and reliable, totally ordered message delivery. This replication method may be implemented internally, i.e., by modifying the service to be replicated to support redundant instances, or externally, i.e., by wrapping around an unmodified service, replicating input to multiple instances, and unifying output from these instances. Internal replication offers usually more performance, while external replication is typically easier to implement.

One of the most important HPC system services running on the head node is the job and resource manager. If it goes down, all currently running jobs lose the service they report back to. They have to be restarted once the head node is up and running again. JOSHUA is a generic solution that provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the Portable Batch System (PBS) service interface without the need to modify any service code. This makes it portable to most HPC job and resource managers as the PBS service interface is supported widely. The results, as well as, availability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance.

The metadata service (MDS) of a networked parallel file system is another critical single point of failure. An interruption of service typically results in the failure of currently running applications utilizing its file system. A loss of state requires repairing the entire file system, which could take days on large-scale systems, and may cause loss of data. The developed proof-of-concept prototype for the MDS of the Parallel Virtual File System (PVFS) offers symmetric active/active replication using virtual synchrony with an internal replication implementation. In addition to providing high availability, this solution is taking advantage of the internal replication implementation by load balancing MDS read requests, improving performance over the non-replicated MDS. The results show that MDS high availability can be achieved with an acceptable performance.

Assuming a mean-time to failure of 5,000 hours for a single head or service node, the presented solutions improve service availability from 99.285% (2 nines) of a single node to 99.995% (4 nines) in a two-node system, and to 99.99996% (6 nines) with three nodes. Replication beyond three nodes may not provide further benefits, other than performance increases due to load balancing, since after three nodes, catastrophic incidents, such as earthquakes, hurricanes, and tornados, have a higher impact on availability than component failures.


External replication of the PBS service


Internal replication of the PVFS MDS service


PVS MDS service read performance


PVS MDS service write performance


Service availability improvement

Participating Institutions

Research Projects

Funding Sources

Important Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

  1. Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. Abstract Publication BibTeX Citation DOI Link
  2. Christian Engelmann. Symmetric Active/Active High Availability for High-Performance Computing System Services. PhD thesis, Department of Computer Science, University of Reading, UK, 2008. Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading). Abstract Publication Presentation BibTeX Citation
  3. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation DOI Link
  4. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Transparent Symmetric Active/Active Replication for Service-Level High Availability. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007: 7th International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2833-3. Abstract Publication Presentation BibTeX Citation DOI Link
  5. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. Abstract Publication BibTeX Citation DOI Link
  6. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Active/Active Replication for Highly Available HPC System Services. In Proceedings of the 1st International Conference on Availability, Reliability and Security (ARES) 2006: 1st International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006, pages 639-645, Vienna, Austria, April 20-22, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2567-9. Abstract Publication Presentation BibTeX Citation DOI Link
Comments are closed.