The next generation of computational science applications require numerical solvers that are capable of high performance on proposed exascale platforms. In order to meet this goal, solvers must be resilient to soft and hard failures, provide high concurrency on heterogeneous hardware configurations, and retain numerical accuracy and efficiency. In light of these requirements, a natural avenue of inquiry would be to adapt the current stable of numerically efficient solvers to this new high-performance computing (HPC) regime. However, an alternative approach is to investigate different classes of algorithms that can address issues of resiliency naturally.
This project investigates new stochastic methods for solving linear systems, otherwise termed Monte Carlo Resilient Exascale (MCREX) solvers. The family of methods builds on the sequential Monte Carlo work of Halton, 1962. While showing significant promise, this class of solvers has not made inroads into the broader computational science community. Our initially developed methods use Monte Carlo to accelerate a fixed-point iteration. Therefore, they are called Monte Carlo Synthetic Acceleration (MCSA). Preliminary work using MCSA has demonstrated that they are at least as efficient as Jacobi-preconditioned Conjugate Gradient (PCG) on sparse, symmetric positive definite (SPD) systems. These initial results demonstrate that, because MCSA does not require symmetry or positive definiteness, very good efficiency could be attained on non-symmetric systems, thus making MCSA an ideal solver in non-linear Newton schemes. Furthermore, Monte Carlo methods have the benefit of addressing resilience in a natural way; soft errors can be treated as high variance samples and lost histories from processor failures can be easily discarded without affecting the quality of the solution.
The developed MCREX solver is evaluated using the Extreme-scale Simulator (xSim). xSim is a performance/resilience investigation toolkit that permits running native HPC applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance and resilience in a simulated extreme-scale system for hardware/software co-design.
- Resilient Extreme-Scale Solvers (RX-Solvers) Program, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Peer-reviewed Journal Publications
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, August 1, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. DOI 10.1002/cpe.3805.
Peer-reviewed Conference Publications
- Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005.
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224).
Symbols: Abstract, Publication, Presentation, BibTeX Citation