Home > Uncategorized > About Me

About Me

January 10th, 2020

Dr. Christian Engelmann is a Senior R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.6 billion. He has more than 19 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience 2013-2015. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processing units, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as a conjoint degree, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).

Download the NSF-style 2-page bio. Download the full list of publications. A resume available upon request.

Contact Information

QR Code

e-Mail: engelmannc@computer.org|engelmannc@ornl.gov
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6164, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491
View Christian Engelmann's profile on LinkedIn View Christian Engelmann's profile on Google Scholar View Christian Engelmann's profile on Facebook
DBLP: Christian Engelmann ORCID iD iconorcid.org/0000-0003-4365-6416
Scopus ID: 18037364000

Job Opportunities

In the News

2018-11-19: HPCwire: What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems
2015-07-15: ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
2015-07-15: HPC Wire – Tackling Power and Resilience at Exascale.
2015-07-15: ComputerWorld Australia – Supercomputers face growing resilience problems.

Professional Accomplishments

13 Research grants ($29.45M, 5 as lead investigator) 11 Peer-reviewed journal articles 56 Invited talks and seminars
8 Co-advised Master theses 53 Peer-reviewed conference papers 159 Committees at 44 conference series
4 Mentored summer faculty 43 Peer-reviewed workshop papers 54 Journal article and book proposal reviews
12 Direct reports over the past 10 years 12 Peer-reviewed conference posters 11 Conference booth exhibitions
Erdős number of 3 3,500+ Total publication citations H-index of 29 / i10-index of 62
Awards: 2015 US Department of Energy Early Career Research Award

Ongoing Research Activities

2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software … more.

Most Recent Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, Warsaw, Poland, August 24-28, 2020. Springer Verlag, Berlin, Germany. Acceptance rate 24.5% (39/159). Abstract BibTeX Citation
  2. Petar Radojkovic, Manolis Marazakis, Paul Carpenter, Reiley Jeyapaul, Dimitris Gizopoulos, Martin Schulz, Adria Armejach, Eduard Ayguade, François Bodin, Ramon Canal, Franck Cappello, Fabien Chaix, Guillaume Colin de Verdiere, Said Derradji, Stefano Di Carlo, Christian Engelmann, Ignacio Laguna, Miquel Moreto, Onur Mutlu, Lazaros Papadopoulos, Olly Perks, Manolis Ploumidis, Bezhad Salami, Yanos Sazeides, Dimitrios Soudris, Yiannis Sourdis, Per Stenstrom, Samuel Thibault, Will Toms, and Osman Unsal. Towards Resilient EU HPC Systems: A Blueprint. White paper submitted to the European HPC resilience initiative, April 9, 2020. Abstract Publication BibTeX Citation
  3. Christian Engelmann. The Resilience Problem in Extreme Scale Computing. Invited talk at the 19th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2020, Seattle, WA, USA, February 12-15, 2020. Abstract Presentation BibTeX Citation
  4. Piyush Sao, Christian Engelmann, Srinivas Eswar, Oded Green, and Richard Vuduc. Self-stabilizing Connected Components. In Proceedings of the 32nd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2019: 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019, pages 50-59, Denver, CO, USA, November 22, 2019. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-6013-9. DOI 10.1109/FTXS49593.2019.00011. Acceptance rate 60.0% (6/10). Abstract Publication Presentation BibTeX Citation
  5. Christian Engelmann. Resilience in Parallel Programming Environments. Invited talk at the 8th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Tokyo, Japan, October 30-31, 2019. Abstract Presentation BibTeX Citation
  6. Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, pages 78-93, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. ISBN 978-3-030-28595-1. DOI 10.1007/978-3-030-28596-8_6. Abstract Publication Presentation BibTeX Citation
  7. Yawei Hui, Rizwan Ashraf, Byung Hoon (Hoony) Park, and Christian Engelmann. Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing. Poster at the 6th IEEE International Conference on Big Data (BigData) 2018, Seattle, WA, USA, December 10-13, 2018. Abstract Publication BibTeX Citation
  8. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  9. Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 39-48, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00008. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  10. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Summarizing HPC System Status. Poster at the 8th IEEE Symposium on Large Data Analysis and Visualization in conjunction with the 8th IEEE Vis 2018, Berlin, Germany, October 21, 2018. Abstract Publication BibTeX Citation

Most Cited Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123). 445 citations. Abstract Publication Presentation BibTeX Citation
  2. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, May 1, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573. 311 citations. Abstract Publication BibTeX Citation
  3. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). 273 citations. Abstract Publication Presentation BibTeX Citation
  4. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). 199 citations. Abstract Publication Presentation BibTeX Citation
  5. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). 149 citations. Abstract Publication Presentation BibTeX Citation
  6. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). 107 citations. Abstract Publication Presentation BibTeX Citation
  7. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419). 103 citations. Abstract Publication Presentation BibTeX Citation
  8. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253). 91 citations. Abstract Publication Presentation BibTeX Citation
  9. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. 90 citations. Abstract Publication Presentation BibTeX Citation
  10. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569). 72 citations. Abstract Publication Presentation BibTeX Citation

Comments are closed.