Skip to content

2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing

This project produces proof-of-concept solutions that enables the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focuses on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure.

We develop replication mechanisms for providing symmetric active/active high availability for services running on head and service nodes in order to offer the highest level of availability without significantly impacting performance. The targeted prototypes for the batch job management system, TORQUE, and the parallel virtual file system (PVFS) metadata server can offer 99.9997% service uptime using just 3 redundant nodes.

For distributed data storage, the proposed FreeLoader solution is built on a contributed desktop storage substrate. We develop parallel I/O mechanisms to store/access data to/from network workstations as well as caching mechanisms to store more recently used datasets. FreeLoader can offer high retrieval rates for large datasets using novel striping strategies. It also may be utilized as a virtual cache, storing only prefixes of datasets and yet delivering the entire dataset by masking the suffix patching.

Prominent Solutions

Funding Sources

Participating Institutions

Peer-reviewed Journal Publications

  1. Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, December 1, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2009.08.004. Abstract Publication BibTeX Citation
  2. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, December 1, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. DOI 10.4304/jcp.1.8.43-54. Abstract Publication BibTeX Citation

Peer-reviewed Conference Publications

  1. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.64. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation
  2. Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1. Acceptance rate 49%. Abstract Publication Presentation BibTeX Citation
  3. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. On Programming Models for Service-Level High Availability. In Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES) 2007, pages 999-1006, Vienna, Austria, April 10-13, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2775-2. DOI 10.1109/ARES.2007.109. Acceptance rate 28.3% (60/212). Abstract Publication Presentation BibTeX Citation
  4. Kai Uhlemann, Christian Engelmann, and Stephen L. Scott. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster) 2006, pages 1-10, Barcelona, Spain, September 25-28, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 1-4244-0328-6. ISSN 1552-5244. DOI 10.1109/CLUSTR.2006.311855. Acceptance rate 33.1% (42/127). Abstract Publication Presentation BibTeX Citation
  5. Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. DOI 10.1145/1183401.1183433. Acceptance rate 26.2% (37/141). Abstract Publication Presentation BibTeX Citation
  6. Daniel I. Okunbor, Christian Engelmann, and Stephen L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of the 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006. Abstract Publication BibTeX Citation
  7. Kshitij Limaye, Chokchai (Box) Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard M. Libby, and Kasidit Chanchio. Job-Site Level Fault Tolerance for Cluster and Grid Environments. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster) 2005, pages 1-9, Boston, MA, USA, September 26-30, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7803-9486-0. ISSN 1552-5244. DOI 10.1109/CLUSTR.2005.347043. Acceptance rate 39.6% (45/138). Abstract Publication BibTeX Citation
  8. Hertong Song, Chokchai (Box) Leangsuksun, Raja Nassar, Yudan Liu, Christian Engelmann, and Stephen L. Scott. UML-based Beowulf Cluster Availability Modeling. In International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161-167, Las Vegas, NV, USA, June 27-30, 2005. CSREA Press. ISBN 1-932415-49-1. BibTeX Citation

Peer-reviewed Workshop Publications

  1. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008: Workshop on Resiliency in High Performance Computing (Resilience) 2008, pages 813-818, Lyon, France, May 19-22, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3156-4. DOI 10.1109/CCGRID.2008.78. Abstract Publication Presentation BibTeX Citation
  2. Xin Chen, Benjamin Eckart, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. An Online Controller Towards Self-Adaptive File System Availability and Performance. In Proceedings of the 5th High Availability and Performance Workshop (HAPCW) 2008, in conjunction with the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, April 3-4, 2008. Abstract Publication Presentation BibTeX Citation
  3. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Transparent Symmetric Active/Active Replication for Service-Level High Availability. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007: 7th International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2833-3. DOI 10.1109/CCGRID.2007.116. Abstract Publication Presentation BibTeX Citation
  4. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  5. Li Ou, Xin Chen, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. Achieving Computational I/O Effciency in a High Performance Cluster Using Multicore Processors. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  6. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Active/Active Replication for Highly Available HPC System Services. In Proceedings of the 1st International Conference on Availability, Reliability and Security (ARES) 2006: 1st International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006, pages 639-645, Vienna, Austria, April 20-22, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2567-9. DOI 10.1109/ARES.2006.23. Abstract Publication Presentation BibTeX Citation
  7. Christian Engelmann and Stephen L. Scott. Concepts for High Availability in Scientific High-End Computing. In Proceedings of the 3rd High Availability and Performance Workshop (HAPCW) 2005, in conjunction with the 6th Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005. Abstract Publication Presentation BibTeX Citation
  8. Christian Engelmann and Stephen L. Scott. High Availability for Ultra-Scale High-End Scientific Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  9. Chokchai (Box) Leangsuksun, Venkata K. Munganuru, Tong Liu, Stephen L. Scott, and Christian Engelmann. Asymmetric Active-Active High Availability for High-end Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  10. Xubin (Ben) He, Li Ou, Stephen L. Scott, and Christian Engelmann. A Highly Available Cluster Storage System using Scavenging. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Conference Posters

  1. Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, Jyothish Varma, Xubin (Ben) He, Li Ou, and Xin Chen. Resiliency for High-Performance Computing Systems. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation

Talks and Lectures

  1. Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008. Abstract Presentation BibTeX Citation
  3. Christian Engelmann. Modular Redundancy in HPC Systems: Why, Where, When and How?. Invited talk at the 1st HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2008, in conjunction with the 1st Los Alamos Computer Science Symposium (LACSS) 2008, Santa Fe, NM, USA, October 15, 2008. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008. Abstract Presentation BibTeX Citation
  6. Christian Engelmann. Service-Level High Availability in Parallel and Distributed Systems. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 10, 2007. Abstract Presentation BibTeX Citation
  7. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Khon Kean, Thailand, June 8, 2007. Abstract Presentation BibTeX Citation
  8. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Bangkok, Thailand, June 4-5, 2007. Abstract Presentation BibTeX Citation
  9. Christian Engelmann. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, March 14, 2007. Abstract Presentation BibTeX Citation
  10. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, June 9, 2006. Abstract Presentation BibTeX Citation
  11. Stephen L. Scott and Christian Engelmann. Advancing Reliability, Availability and Serviceability for High-Performance Computing. Seminar at the Institute of Graphics and Parallel Processing, Johannes Kepler University, Linz, Austria, April 19, 2006. Abstract Presentation BibTeX Citation
  12. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 18, 2005. Abstract Presentation BibTeX Citation
  13. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Mathematics and Computer Science, Fayetteville State University, Fayetteville, NC, USA, September 26, 2005. Abstract Presentation BibTeX Citation
  14. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 13, 2005. Abstract Presentation BibTeX Citation
  15. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Center for Entrepreneurship and Information Technology, Louisiana Tech University, Ruston, LA, USA, April 15, 2005. Abstract Presentation BibTeX Citation

Co-advised Theses

  1. Matthias Weber. High Availability for the Lustre File System. Master’s thesis, Department of Computer Science, University of Reading, UK, March 14, 2007. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  2. Kai Uhlemann. High Availability for High-End Scientific Computing. Master’s thesis, Department of Computer Science, University of Reading, UK, March 6, 2006. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation

Theses

  1. Christian Engelmann. Symmetric Active/Active High Availability for High-Performance Computing System Services. PhD thesis, Department of Computer Science, University of Reading, UK, December 8, 2008. Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading). Abstract Publication Presentation BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation