xSim – The Extreme-scale Simulator
The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance and resilience of parallel applications at scale on future architectures and the performance and resilience impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance and resilience on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.
The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running native HPC applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance and resilience in a simulated extreme-scale system for hardware/software co-design. Using a lightweight parallel discrete event simulation (PDES), xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model with an appropriate simulation scalability/accuracy trade-off. xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface. It has been run up to 134,217,728 (2^27) communicating MPI ranks using a 960-core Linux cluster.
xSim also permits the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling within the simulation using application-level checkpoint/restart. These capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique. Another feature provides user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). This permits investigating performance under failure and failure handling of ABFT solutions using the fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim is the very first performance tool that supports ULFM and ABFT.
xSim is currently being extended and used to study the effects of soft errors on a Monte Carlo solver developed by the MCREX project.
Scaling a Monte Carlo code to 2^24 MPI ranks
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918.
- Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1.
- Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA.
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2.