About Me
Dr. Christian Engelmann is a Senior Scientist and the Intelligent Systems and Facilities Group Leader at Oak Ridge National Laboratory (ORNL), which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $2.4 billion. He has more than 21 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.
His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience 2013-2015. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processing units, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption. Dr. Engelmann is also an expert in system software for parallel and distributed systems.
Dr. Engelmann is leading ORNL’s Intelligent Systems and Facilities Group, which addresses system software challenges for scientific instruments and facilities. The group collaborates with computer, computational, instrument and domain science experts across ORNL, other national laboratories, and universities to foster scientific leadership in smart systems, instruments, and facilities. The group’s research and development connects scientific instruments and facilities with edge and leadership computing capability and provides operational intelligence with machine-in-the-loop feedback. It enables science breakthroughs with autonomous experiments, “self-driving” laboratories, smart manufacturing, and AI-driven design, discovery and evaluation.
Dr. Engelmann earned a Dipl.-Ing. (FH), a German engineering degree and M.Sc. equivalent, in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as a conjoint degree, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).
Download the NSF-style 2-page bio. Download the full list of publications. Resume available upon request.
Contact Information
engelmannc@computer.org|engelmannc@ornl.gov P.O. Box 2008, Oak Ridge, TN 37831-6164, USA |
Tel.:+1 (865) 574-3132 Fax:+1 (865) 576-5491 |
![]() ![]() |
![]() |
![]() |
![]() Scopus ID: 18037364000 |
Professional Accomplishments
14 | Research grants ($31.57M, 6 as lead investigator): | 113 | Peer-reviewed articles/papers: | 4,560 | Publication citations: |
6 Current direct reports | 13 Journal articles | H-index: 33, i10-index: 72 | |||
8 Co-advised M.Sc. theses | 55 Conference papers | Erdős number: 3 | |||
4 Mentored summer faculty | 45 Workshop papers | 178 | Committees at 47 conference series | ||
60 | Invited talks and seminars | 12 | Peer-reviewed conference posters | 60 | Journal article and book proposal reviews |
Awards: 2015 US Department of Energy Early Career Research Award |
Ongoing Research Activities
2021-…: The Open Federated Architecture for the Laboratory of the Future project connects scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, “self-driving” laboratories, smart manufacturing, and AI-driven design, discovery and evaluation. … more.
2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software. … more.
In the News
2021-03-30: DOE Advanced Scientific Computing Research. New Approach to Fault Tolerance Means More Efficient High-Performance Computers.
2021-01-04: HPCwire. What’s New in HPC Research: GPU Lifetimes, the Square Kilometre Array, Support Tickets & More.
2018-11-19: HPCwire. What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More.
2018-08-05: inside HPC. Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems.
2015-07-15: ASCR Discovery. Mounting a charge. Early-career awardees attack exascale computing on two fronts: Power and resilience.
2015-07-15: HPC Wire. Tackling Power and Resilience at Exascale.
2012-11-21: ComputerWorld. Supercomputers face growing resilience problems.
5 Most Cited Peer-Reviewed Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, June, 2007. DOI 10.1145/1274971.1274978. Accept. rate 23.6% (29/123). 497 citations.
- M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, May, 2014. DOI 10.1177/1094342014522573. 425 citations.
- D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, November, 2012. DOI 10.1109/SC.2012.49. Accept. rate 21.2% (100/472). 342 citations.
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, November, 2008. DOI 10.1145/1413370.1413414. Accept. rate 21.3% (59/277). 228 citations.
- J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, June, 2012. DOI 10.1109/ICDCS.2012.56. Accept. rate 13.8% (71/515). 187 citations.
5 Most Recent Peer-Reviewed Publications
Symbols: Abstract,
Publication,
Presentation,
BibTeX Citation
- E. Agullo, M. Altenbernd, H. Anzt, L. Bautista-Gomez, T. Benacchio, L. Bonaventura, H. Bungartz, S. Chatterjee, F. M. Ciorba, N. DeBardeleben, D. Drzisga, S. Eibl, C. Engelmann, W. N. Gansterer, L. Giraud, D. Göddeke, M. Heisig, F. Jézéquel, N. Kohl, X. S. Li, R. Lion, M. Mehl, P. Mycek, M. Obersteiner, E. S. Quintana-Ortí, F. Rizzi, U. Rüde, M. Schulz, F. Fung, R. Speck, L. Stals, K. Teranishi, S. Thibault, D. Thönnes, A. Wagner, and B. Wohlmuth. Resiliency in Numerical Algorithm Design for Extreme Scale Simulations. International Journal of High Performance Computing Applications (IJHPCA), volume 36, number 2, March, 2022. DOI 10.1177/10943420211055188.
- M. Kumar and C. Engelmann. RDPM: An Extensible Tool for Resilience Design Patterns Modeling. In Lecture Notes in Computer Science: Proceedings of the 27th European Conference on Parallel and Distributed Computing (Euro-Par) 2021 Workshops: 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, August, 2021. DOI 10.1007/978-3-031-06156-1_23. Accept. rate 66.7% (4/6).
- M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. Journal of Parallel and Distributed Computing (JPDC), volume 153, July, 2021. DOI 10.1016/j.jpdc.2021.03.001.
- S. Hukerikar and C. Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, December, 2020. DOI 10.1109/PRDC50213.2020.00014. Accept. rate 40.9% (18/44).
- G. Ostrouchov, D. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, and J. Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, November, 2020. DOI 10.1109/SC41405.2020.00045. Accept. rate 25.1% (95/378).