xSim – The Extreme-scale Simulator
The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.
The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running native HPC applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. Using a lightweight parallel discrete event simulation (PDES), xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model with an appropriate simulation scalability/accuracy trade-off. xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface. It has been run up to 134,217,728 (2^27) communicating MPI ranks using a 960-core Linux cluster.
Scaling a Monte Carlo application to 2^24 MPI ranks
Resilience, i.e., providing efficiency and correctness in the presence of faults, is, next to performance and power consumption, one of the most important exascale computer science challenges as systems scale up in component count (up to 1,000,000 nodes, each with up to 10,000 cores by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several HPC resilience technologies have been developed, such as checkpoint/restart, message logging, algorithm-based fault tolerance, containment domains, and redundancy. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption.
Ongoing work focuses on adding features to xSim to (1) permit the injection of different faults, errors, and failures into the simulation, (2) model various propagation, isolation, and detection properties of the simulated system, (3) support a variety of avoidance, masking, and recovery strategies, (4) model the power consumption of the entire simulated system, and (5) study the performance, resilience, and power consumption impact with different parameter sets for (1), (2), (3), and (4) using standardized metrics.
Newly developed features for xSix permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique. Another newly added feature offers user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The solution permits investigating performance under failure and failure handling of ABFT solutions using the fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918.
- Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1.
- Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA.
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2.