xSim – The Extreme-scale Simulator
The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance and resilience of parallel applications at scale on future architectures and the performance and resilience impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance and resilience on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.
The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running native HPC applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance and resilience in a simulated extreme-scale system for hardware/software co-design. Using a lightweight parallel discrete event simulation (PDES), xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model with an appropriate simulation scalability/accuracy trade-off. xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface. It has been run up to 134,217,728 (2^27) communicating MPI ranks using a 960-core Linux cluster.
xSim also permits the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling within the simulation using application-level checkpoint/restart. These capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique. Another feature provides user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). This permits investigating performance under failure and failure handling of ABFT solutions using the fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim is the very first performance tool that supports ULFM and ABFT.
xSim also permits the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. As radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems, xSim enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions using this bit flip fault injection feature. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
Scaling a Monte Carlo code to 2^24 MPI ranks
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 42.0% (21/50). Best paper candidate.
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634.
- Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0.
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3.
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. Best paper candidate.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918.
- Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1.
- Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA.
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2.