Dr. Christian Engelmann is an R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.4 billion. He has 15 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability. His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and a member of the DOE Technical Council on HPC Resilience. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processors, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption. Dr. Engelmann is a member of the Association for Computing Machinery (ACM), the Institute of Electrical and Electronics Engineers (IEEE), and the Advanced Computing Systems Association (USENIX).
e-Mail: email@example.com / firstname.lastname@example.org
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6016, USA
Phone: +1 (865) 574-3132
Fax: +1 (865) 576-5491
In the News
- ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
- HPC Wire – Tackling Power and Resilience at Exascale.
|12 Research grants ($29.4M, 4 as lead-PI)||9 Peer-reviewed journal articles||44 Invited talks and seminars|
|8 Co-advised Master theses||43 Peer-reviewed conference papers||118 Committees at 39 conference series|
|4 Mentored summer faculty||32 Peer-reviewed workshop papers||46 Journal article and book proposal reviews|
|10 Direct reports over the past 10 years||8 Peer-reviewed conference posters||11 Conference booth exhibitions|
|Erdős number of 5||1900 Total publication citations||H-index of 22 / G-index of 37|
|Awards: 2015 US Department of Energy Early Career Research Award|
Ongoing Research Activities
2015-…: The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software … more.
2015-…: The Characterizing Faults, Errors, and Failures in Extreme-Scale Systems (Catalog) project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems … more
2013-…: Hobbes – Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools. … more
2013-…: Resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for exascale high-performance computing (HPC) systems … more
2016-06-01: Chaired the System Software Session at the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) 2016, Kyoto, Japan.
2016-05-31: Presentation of the co-authored research paper, Adding Fault Tolerance to NPB Benchmarks Using ULFM, at the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016, Kyoto, Japan.
2016-05-24: Presentation by Leonardo Bautista-Gomez of the co-authored research paper, educing Waste in Large Scale Systems Through Introspective Analysis, at the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, Chicago, IL, USA.
2016-04-14: Invited talk on The Missing High-Performance Computing Fault Model at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France.
2016-02-18: Seminar on Resilience Challenges and Solutions for Extreme-Scale Supercomputing at the United States Naval Academy, Annapolis, MD, USA.
2016-02-16: Presentation by Thomas Naughton of the co-authored research paper, Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation, at the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria.
2015-11-19: With 261,632 NVIDIA K20x accelerator cores and 298,592 AMD Opteron cores, ORNL’s Titan Cray XK7 supercomputer continues to be ranked 2nd in the Top 500 List of supercomputers with a LINPACK performance of 17.95 PFlops and a power consumption of 8.2 MW.
2015-11-18: Technical program chair at the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), held in conjunction with the 28th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Austin, TX, USA.
2015-09-23: Presentation by Amogh Katti of the co-authored research paper, Scalable and Fault Tolerant Failure Detection and Consensus, at the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, Bordeaux, France.
2015-08-24: Keynote talk on Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems at the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 21st European Conference on Parallel and Distributed Computing (Euro-Par) 2015, Vienna, Austria.
2015-06-03: Selected for a financial award by the US Department of Energy’s Office of Science under the fiscal year 2015 Resilience for Extreme Scale Supercomputing Systems Program to pursue computer science research in Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems.
2015-05-06: Selected for a financial award by the US Department of Energy’s Office of Science under the fiscal year 2015 Early Career Research Program to pursue computer science research in Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale.
Important Peer-reviewed Journal Publications
Symbols: Abstract, Publication, BibTeX Citation, DOI Link
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. To appear.
- Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X.
Important Peer-reviewed Conference Publications
Symbols: Abstract, Publication, Presentation, BibTeX Citation, DOI Link
- David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. Acceptance rate 24.2% (43/178). To appear.
- Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-aware Checkpointing: Toward the Optimal Checkpointing Interval under Power Capping. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 22.4% (58/259). To appear.
- Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Large Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, Chicago, IL, USA, May 23-27, 2012. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 23.0% (114/496). To appear.
- David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472).
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13% (71/515).
- Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 21% (118/569).
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188).
- Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253).