2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing
The goal of this work was to produce a proof-of-concept solution that will enable the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focused on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure. We developed replication mechanisms for providing symmetric active/active high availability for services running on head and service nodes in order to offer the highest level of availability without significantly impacting performance. The implemented prototypes for the batch job management system, TORQUE, and the parallel virtual file system (PVFS) metadata server offer 99.9997% service uptime using just 3 redundant nodes. For distributed data storage, the developed FreeLoader solution is built on a contributed desktop storage substrate. We developed parallel I/O mechanisms to store/access data to/from network workstations as well as caching mechanisms to store more recently used datasets. FreeLoader can offer high retrieval rates for large datasets using novel striping strategies. It also may be utilized as a virtual cache, storing only prefixes of datasets and yet delivering the entire dataset by masking the suffix patching.
Solutions
Participating Institutions
Funding Sources
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Important Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation,
DOI Link
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008: Workshop on Resiliency in High Performance Computing (Resilience) 2008, pages 813-818, Lyon, France, May 19-22, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3156-4.
- Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. DOI 10.4304/jcp.1.8.43-54.