2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing

January 4th, 2019

The goal of this work was to produce a proof-of-concept solution that will enable the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focused on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure. We developed replication mechanisms for providing symmetric active/active high availability for services running on head and service nodes in order to offer the highest level of availability without significantly impacting performance. The implemented prototypes for the batch job management system, TORQUE, and the parallel virtual file system (PVFS) metadata server offer 99.9997% service uptime using just 3 redundant nodes. For distributed data storage, the developed FreeLoader solution is built on a contributed desktop storage substrate. We developed parallel I/O mechanisms to store/access data to/from network workstations as well as caching mechanisms to store more recently used datasets. FreeLoader can offer high retrieval rates for large datasets using novel striping strategies. It also may be utilized as a virtual cache, storing only prefixes of datasets and yet delivering the entire dataset by masking the suffix patching.


Participating Institutions

Funding Sources

Important Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

Comments are closed.