About Me

October 21st, 2018 Comments off

Dr. Christian Engelmann is an Senior R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.4 billion. He has 17 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processors, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as part of a double diploma, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and a Member of the Institute of Electrical and Electronics Engineers (IEEE), the Society for Industrial and Applied Mathematics (SIAM), and the Advanced Computing Systems Association (USENIX).

Download NSF-style 2-page bio. Download full list of publications. Resume available upon request.

Contact Information

QR Code

e-Mail: engelmannc@computer.org|engelmannc@ornl.gov
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6016, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491

View Christian Engelmann's profile on LinkedIn
View Christian Engelmann's profile on Google Scholar View Christian Engelmann's profile on Facebook
ORCID iD iconorcid.org/0000-0003-4365-6416
Scopus ID: 18037364000

In the News

2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems
2015-07-15: ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
2015-07-15: HPC Wire – Tackling Power and Resilience at Exascale.

Professional Accomplishments

13 Research grants ($29.8M, 4 as lead-PI) 11 Peer-reviewed journal articles 52 Invited talks and seminars
8 Co-advised Master theses 52 Peer-reviewed conference papers 144 Committees at 44 conference series
4 Mentored summer faculty 41 Peer-reviewed workshop papers 50 Journal article and book proposal reviews
10 Direct reports over the past 10 years 11 Peer-reviewed conference posters 11 Conference booth exhibitions
Erdős number of 3 2,900+ Total publication citations H-index of 28 / i10-index of 55
Awards: 2015 US Department of Energy Early Career Research Award

Ongoing Research Activities

2018-…: The rOpenMP project performs research to enable fine-grain resilience for supercomputers with accelerators that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for quality of service and corresponding extensions for the for OpenMP parallel programming model … more.
2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software … more.
2015-…: The Characterizing Faults, Errors, and Failures in Extreme-Scale Systems (Catalog) project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems … more

Recent Events

2018-10-21: Presentation of the co-authored research poster, A Comprehensive Informative Metric for Summarizing HPC System Status, 8th IEEE Symposium on Large Data Analysis and Visualization in conjunction with the 8th IEEE Vis 2018, Berlin, Germany.
2018-08-28: Presentation of the co-authored research paper, Performance Efficient Multiresilience using Checkpoint Recovery in Iterative Algorithms, 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 24th European Conference on Parallel and Distributed Computing (Euro-Par) 2018 Workshops, Turin, Italy.
2018-07-03: Invited talk on Characterizing Faults, Errors, and Failures in Extreme-Scale Systems at the Platform for Advanced Scientific Computing (PASC) Conference 2018, Basel, Switzerland.
2018-06-25: With 27,648 NVIDIA V100 Volta GPUs and 9,216 22-core 4-way SMT IBM Power 9 processors in 4,608 compute nodes, ORNL’s Summit IBM supercomputer is 1st in the Top 500 List of supercomputers with a LINPACK performance of 122.3 PFlops and a power consumption of 8.8 MW.
2018-06-25: With 18,688 NVIDIA K20X Kepler GPUs and 18,688 16-core AMD Opteron processors in 18,688 compute nodes, ORNL’s Titan Cray XK7 supercomputer is 7th in the Top 500 List of supercomputers with a LINPACK performance of 17.95 PFlops and a power consumption of 8.2 MW.

Important Peer-reviewed Journal Publications

Symbols: Abstract Abstract, Publication Publication, BibTeX Citation BibTeX Citation

  1. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Journal of Supercomputing Frontiers and Innovations (JSFI), volume 4, number 3, pages 4-42, 2017. South Ural State University Chelyabinsk, Russia. ISSN 2409-6008. DOI 10.14529/jsfi170301. Abstract Publication BibTeX Citation
  2. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573. Abstract Publication BibTeX Citation
  3. Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2013.04.014. Abstract Publication BibTeX Citation
  4. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009. Abstract Publication BibTeX Citation
  5. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. DOI jcp/vol01/no08/jcp01084354.html. Abstract Publication BibTeX Citation

Important Peer-reviewed Conference Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  2. Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing. In Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE) 2018, pages 80-87, Berlin, Germany, April 9-13, 2018. ACM Press, New York, NY, USA. ISBN 978-1-4503-5095-2. DOI 10.1145/3184407.3184421. Acceptance rate 23.7% (14/59). Abstract Publication Presentation BibTeX Citation
  3. Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937. Acceptance rate 18.7% (61/327). Abstract Publication Presentation BibTeX Citation
  4. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472). Abstract Publication Presentation BibTeX Citation
  5. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569). Abstract Publication Presentation BibTeX Citation