Dr. Christian Engelmann is an R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.4 billion. He has 16 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.
His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and a member of the DOE Technical Council on HPC Resilience. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC.
His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processors, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.
Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as part of a double diploma, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a member of the Association for Computing Machinery (ACM), the Institute of Electrical and Electronics Engineers (IEEE), and the Advanced Computing Systems Association (USENIX).
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6016, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491
In the News
- ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
- HPC Wire – Tackling Power and Resilience at Exascale.
|12 Research grants ($29.4M, 4 as lead-PI)||10 Peer-reviewed journal articles||44 Invited talks and seminars|
|8 Co-advised Master theses||45 Peer-reviewed conference papers||125 Committees at 40 conference series|
|4 Mentored summer faculty||34 Peer-reviewed workshop papers||50 Journal article and book proposal reviews|
|10 Direct reports over the past 10 years||8 Peer-reviewed conference posters||11 Conference booth exhibitions|
|Erdős number of 3||+2,200 Total publication citations||H-index of 25 / i10-index of 47|
|Awards: 2015 US Department of Energy Early Career Research Award|
Ongoing Research Activities
2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software … more.
2015-…: The Characterizing Faults, Errors, and Failures in Extreme-Scale Systems (Catalog) project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems … more
2016-12-31: Publication of the Resilience Design Pattern specification, version 1.1, as technical report.
2016-11-14: With 261,632 NVIDIA K20x accelerator cores and 298,592 AMD Opteron cores, ORNL’s Titan Cray XK7 supercomputer continues to be 3rd in the Top 500 List of supercomputers with a LINPACK performance of 17.95 PFlops and a power consumption of 8.2 MW.
2015-11-13: Technical program chair at the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), held in conjunction with the 29th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, USA.
2016-10-31: Publication of the Resilience Design Pattern specification, version 1.0, as technical report.
2016-09-28: Presentation by Saurabh Hukerikar of the co-authored research paper, Language Support for Reliable Memory Regions, at the 29th International Workshop on Languages and Compilers for Parallel Computing, Rochester, NY, USA
2016-09-21: Presentation of the co-authored research paper, Benchmark Generation and Simulation at Extreme Scale, at the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, London, UK. Best paper candidate.
2016-09-14: Presentation by Saurabh Hukerikar of the co-authored research paper, Havens: Explicit Reliable Memory Regions for HPC Applications, at the 20th IEEE High Performance Extreme Computing Conference (HPEC) 2016, Waltham, MA, USA.
2016-08-23: Presentation by Thomas Naughton of the co-authored research paper, A Cooperative Approach to Virtual Machine Based Fault Injection, at the 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Grenoble, France.
2016-06-29: Presentation by Kun Tang of the co-authored research paper, Power-aware Checkpointing: Toward the Optimal Checkpointing Interval under Power Capping, at the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France.
2016-06-01: Chaired the System Software Session at the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) 2016, Kyoto, Japan.
2016-05-31: Presentation of the co-authored research paper, Adding Fault Tolerance to NPB Benchmarks Using ULFM, at the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016, Kyoto, Japan.
2016-05-24: Presentation by Leonardo Bautista-Gomez of the co-authored research paper, Reducing Waste in Large Scale Systems Through Introspective Analysis, at the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, Chicago, IL, USA.
2016-04-14: Invited talk on The Missing High-Performance Computing Fault Model at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France.
2016-02-18: Seminar on Resilience Challenges and Solutions for Extreme-Scale Supercomputing at the United States Naval Academy, Annapolis, MD, USA.
2016-02-16: Presentation by Thomas Naughton of the co-authored research paper, Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation, at the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria.
Important Peer-reviewed Journal Publications
Symbols: Abstract, Publication, BibTeX Citation, DOI Link
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634.
- Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X.
Important Peer-reviewed Conference Publications
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. Acceptance rate 22.4% (58/259).
- David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. Acceptance rate 24.2% (43/178).
- Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Large Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. Acceptance rate 23.0% (114/496).
- David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472).
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13% (71/515).
- Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 21% (118/569).
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188).
- Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253).