Dr. Christian Engelmann is the Task Lead of the System Software Team in the Computer Science Research Group of the Computer Science and Mathematics Division at Oak Ridge National Laboratory, which is the U.S. Department of Energy’s largest multiprogram science and technology laboratory with an annual budget of $1.6 billion. He has 12 years experience in software research and development (R&D) for extreme-scale high-performance computing (HPC) systems with a strong research funding and publication record. In collaboration with other laboratories and universities, his research aims at solving computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability, for the largest current and future supercomputers in the world. Dr. Engelmann’s primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. As chair and member of several scientific committees and panels, including the U.S. DOE Technical Council on Resilience, he is a leading expert in the HPC resilience community. The term HPC resilience was coined in a co-authored whitepaper in 2009. Dr. Engelmann’s secondary expertise is in HPC hardware/software co-design through lightweight simulation of future-generation extreme-scale systems with up to 134,217,728 (2^27) processor cores, studying the impact of hardware, system software, and parallel application properties on the key HPC system design factors: performance, resilience, and power consumption. His skills further include leading R&D teams, co-advising students, programming in C/C++, MPI, Fortran, and Java, and system administration.
e-Mail: firstname.lastname@example.org / email@example.com
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6016, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491
|10 Research grants ($23.5M, 2 as lead-PI)||8 Peer-reviewed journal articles||38 Invited talks and seminars|
|8 Co-advised Master theses||36 Peer-reviewed conference papers||94 Committees at 37 conference series|
|3 Mentored summer faculty||30 Peer-reviewed workshop papers||32 Article and book proposal reviews|
|10 Direct reports over the past 7 years||8 Peer-reviewed conference posters||11 Conference booth exhibitions|
|Erdős number of 5||910+ Total publication citations||H-index of 16 / G-index of 27|
Ongoing Research Activities
2013-…: Hobbes – Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools. … more
2013-…: Resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for exascale high-performance computing (HPC) systems … more
2012-…: HPC resilience co-design toolkit evaluating the resilience/power/performance cost/benefit trade-off of resilience solutions, identifying hardware/software resilience properties, and coordinating interfaces/responsibilities of individual hardware/software components … more
2014-02-13: Presentation of the co-authored research paper, Supporting the Development of Resilient Message Passing Applications using Simulation, at the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, Turin Italy.
2013-12-16: General co-chair of the 13th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP) 2013, Vietri sul Mare, Italy.
2013-12-04: Presentation by Geoffroy Vallée of the co-authored research paper, A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools, at the 1st International Symposium on Computing and Networking – Across Practical Development and Theoretical Research – (CANDAR) 2013, Matsuyama, Japan.
2013-12-03: Session chair at the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2013, Vancouver, BC, Canada.
2013-11-18: With 261,632 NVIDIA K20x accelerator cores and 298,592 AMD Opteron cores, ORNL’s Titan Cray XK7 supercomputer continues to be ranked 2nd in the Top 500 List of supercomputers with a LINPACK performance of 17.95 PFlops and is now 35th in the Green 500 List of energy-efficient supercomputers.
2013-11-18: Technical program chair at the 4th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), held in conjunction with the 26th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA.
2013-10-02: Presentation of the co-authored research paper, Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems, at the 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), held in conjunction with the 42nd International Conference on Parallel Processing (ICPP) 2013, Lyon, France.
2013-09-19: Presentation of the co-authored whitepaper, A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing, at the Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013, Seattle, WA, USA.
2013-09-04: Talk on Tools for Simulation and Benchmark Generation at Exascale at the 7th Parallel Tools Workshop, Dresden, Germany.
2013-09-03: Invited talk about Resilience Challenges and Solutions for Extreme-Scale Supercomputing at the Technical University of Dresden, Germany.
2013-08-26: Presentation by Thomas Naughton of the co-authored research paper, Using Performance Tools to Support Experiments in HPC Resilience, at the 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops, Aachen, Germany.
2013-08-15: The Hobbes: OS and Runtime Support for Application Composition project has been funded by the DOE’s Office of Advanced Scientific Computing Research under the Exascale Operating and Runtime Systems (ExaOS/R) program.
2013-07-28: The MCREX – Monte Carlo Resilient Exascale Solvers project has been funded by the DOE’s Office of Advanced Scientific Computing Research under the Resilient Extreme-Scale Solvers (RX-Solvers) program.
Important Peer-reviewed Journal Publications
Symbols: Abstract, Publication, BibTeX Citation, DOI Link
- Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), 2014. SAGE Publications. ISSN 1094-3420.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X.
Important Peer-reviewed Conference Publications
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472).
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13% (71/515).
- Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 21% (118/569).
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188).
- Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253).