Archive

Archive for the ‘Uncategorized’ Category

About Me

January 10th, 2020 Comments off

Dr. Christian Engelmann is a Senior R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.6 billion. He has more than 19 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience 2013-2015. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processing units, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as a conjoint degree, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and a Member of the Institute of Electrical and Electronics Engineers (IEEE), the Society for Industrial and Applied Mathematics (SIAM), and the Advanced Computing Systems Association (USENIX).

Download the NSF-style 2-page bio. Download the full list of publications. A resume available upon request.

Contact Information

QR Code

e-Mail: engelmannc@computer.org|engelmannc@ornl.gov
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6164, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491
View Christian Engelmann's profile on LinkedIn View Christian Engelmann's profile on Google Scholar View Christian Engelmann's profile on Facebook
DBLP: Christian Engelmann ORCID iD iconorcid.org/0000-0003-4365-6416
Scopus ID: 18037364000

Job Opportunities

In the News

2018-11-19: HPCwire: What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems
2015-07-15: ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
2015-07-15: HPC Wire – Tackling Power and Resilience at Exascale.
2015-07-15: ComputerWorld Australia – Supercomputers face growing resilience problems.

Professional Accomplishments

13 Research grants ($29.5M, 5 as lead investigator) 11 Peer-reviewed journal articles 55 Invited talks and seminars
8 Co-advised Master theses 52 Peer-reviewed conference papers 154 Committees at 44 conference series
4 Mentored summer faculty 43 Peer-reviewed workshop papers 54 Journal article and book proposal reviews
12 Direct reports over the past 10 years 12 Peer-reviewed conference posters 11 Conference booth exhibitions
Erdős number of 3 3,400+ Total publication citations H-index of 29 / i10-index of 61
Awards: 2015 US Department of Energy Early Career Research Award

Ongoing Research Activities

2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software … more.

Recent Events

2019-11-22: Presentation by Piyush Sao of the co-authored research paper, Self-stabilizing Connected Components, at the 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019, held in conjunction with the 32nd International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA.
2019-11-18: Technical program chair at the 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), held in conjunction with the 32nd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA.
2019-11-18: With 27,648 NVIDIA V100 Volta GPUs and 9,216 22-core 4-way SMT IBM Power 9 processors in 4,608 compute nodes, ORNL’s Summit IBM supercomputer is 1st in the Top 500 List of fastest supercomputers and 5th in the Top 500 List of most energy-efficient supercomputers with a LINPACK performance of 148.6 PFlops, a power consumption of 10.1 MW and an energy efficiency of 14.7 GFlops/Watt.
2019-10-30: Invited talk on Resilience in Parallel Programming Environments at the 8th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Tokyo, Japan.
2019-09-12: Presentation of the co-authored research paper, Concepts for OpenMP Target Offload Resilience, at the 15th International Workshop on OpenMP (IWOMP) 2019, Auckland, New Zealand.
2019-08-27: Technical program chair at the 12th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 25th International European Conference on Parallel and Distributed Computing (Euro-Par) 2019, Göttingen, Germany.
2019-03-28: Invited talk on Resilience by Design (and not as an Afterthought) at the 23rd Workshop on Distributed Supercomputing (SOS) 2019, Asheville, NC.
2019-02-28: Invited talk on Resilience for Extreme Scale Systems: Understanding the Problem at the SIAM Conference on Computational Science and Engineering (CSE) 2019, Spokane, WA, USA.

Most Cited Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123). 441 citations. Abstract Publication Presentation BibTeX Citation
  2. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573. 301 citations. Abstract Publication BibTeX Citation
  3. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). 270 citations. Abstract Publication Presentation BibTeX Citation
  4. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). 195 citations. Abstract Publication Presentation BibTeX Citation
  5. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). 146 citations. Abstract Publication Presentation BibTeX Citation
  6. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). 106 citations. Abstract Publication Presentation BibTeX Citation
  7. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419). 102 citations. Abstract Publication Presentation BibTeX Citation
  8. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253). 90 citations. Abstract Publication Presentation BibTeX Citation
  9. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. 90 citations. Abstract Publication Presentation BibTeX Citation
  10. Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. White paper submitted to the U.S. National Science Foundation's High-end Computing Program, December, 2009. 70 citations. Publication BibTeX Citation