Publications

August 13th, 2014

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation, DOI Link DOI Link

Peer-reviewed Journal Papers

  1. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420. Abstract Publication BibTeX Citation DOI Link
  2. Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. Abstract Publication BibTeX Citation DOI Link
  3. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. Abstract Publication BibTeX Citation DOI Link
  4. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-Level Virtualization Research at Oak Ridge National Laboratory. Future Generation Computer Systems (FGCS), volume 26, number 3, pages 304-307, 2010. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. Abstract Publication BibTeX Citation DOI Link
  5. Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. Abstract Publication BibTeX Citation DOI Link
  6. Xubin (Ben) He, Li Ou, Martha J. Kosa, Stephen L. Scott, and Christian Engelmann. A Unified Multiple-Level Cache for High Performance Cluster Storage Systems. International Journal of High Performance Computing and Networking (IJHPCN), volume 5, number 1-2, pages 97-109, 2007. Inderscience Publishers, Geneve, Switzerland. ISSN 1740-0562. Abstract Publication BibTeX Citation DOI Link
  7. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. Abstract Publication BibTeX Citation DOI Link
  8. Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha R. Gottumukkala, Chokchai (Box) Leangsuksun, Jyothish Varma, Chao Wang, Frank Mueller, Aniruddha G. Shet, and Ponnuswamy (Saday) Sadayappan. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. ACM SIGOPS Operating Systems Review (OSR), volume 40, number 2, pages 63-72, 2006. ACM Press, New York, NY, USA. ISSN 0163-5980. Abstract Publication BibTeX Citation DOI Link

Peer-reviewed Conference Papers

  1. Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. To appear. Abstract BibTeX Citation
  2. Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. Acceptance rate 32.6% (73/224). Abstract Publication Presentation BibTeX Citation DOI Link
  3. Geoffroy Vallée, Thomas Naughton, Swen Böhm, and Christian Engelmann. A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools. In Proceedings of the 1st International Symposium on Computing and Networking – Across Practical Development and Theoretical Research – (CANDAR) 2013, pages 213-219, Matsuyama, Japan, December 4-6, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-2795-1. Acceptance rate 35.8% (28/78). Abstract Publication Presentation BibTeX Citation DOI Link
  4. Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1. Abstract Publication Presentation BibTeX Citation
  5. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472). Abstract Publication Presentation BibTeX Citation
  6. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13.8% (71/515). Abstract Publication Presentation BibTeX Citation DOI Link
  7. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 20.7% (118/569). Abstract Publication Presentation BibTeX Citation DOI Link
  8. Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192. Abstract Publication Presentation BibTeX Citation DOI Link
  9. Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171). Abstract Publication Presentation BibTeX Citation DOI Link
  10. Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9. Abstract Publication Presentation BibTeX Citation DOI Link
  11. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. Acceptance rate 29.6% (77/188). Abstract Publication Presentation BibTeX Citation DOI Link
  12. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253). Abstract Publication Presentation BibTeX Citation DOI Link
  13. Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. Acceptance rate 19.1% (58/304). Abstract Publication Presentation BibTeX Citation DOI Link
  14. Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. Abstract Publication Presentation BibTeX Citation DOI Link
  15. Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. Acceptance rate 25.0% (40/160). Abstract Publication BibTeX Citation DOI Link
  16. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA, March 9-12, 2009. Abstract Publication Presentation BibTeX Citation
  17. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. Acceptance rate 42.0% (58/138). Abstract Publication Presentation BibTeX Citation DOI Link
  18. Alessandro Valentini, Christian Di Biagio, Fabrizio Batino, Guido Pennella, Fabrizio Palma, and Christian Engelmann. High Performance Computing with Harness over InfiniBand. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 151-154, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. Acceptance rate 42.0% (58/138). Abstract Publication BibTeX Citation DOI Link
  19. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. Abstract Publication Presentation BibTeX Citation DOI Link
  20. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. Acceptance rate 21.3% (59/277). Abstract Publication Presentation BibTeX Citation DOI Link
  21. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation DOI Link
  22. Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation DOI Link
  23. Björn Könning, Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Virtualized Environments for the Harness High Performance Computing Workbench. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 133-140, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation DOI Link
  24. Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, Hong H. Ong, and Stephen L. Scott. System-level Virtualization for High Performance Computing. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 636-643, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation DOI Link
  25. Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1. Acceptance rate 49%. Abstract Publication Presentation BibTeX Citation DOI Link
  26. Emanuele Di Saverio, Marco Cesati, Christian Di Biagio, Guido Pennella, and Christian Engelmann. Distributed Real-Time Computing with Harness. In Lecture Notes in Computer Science: Proceedings of the 14th European PVM/MPI Users` Group Meeting (EuroPVM/MPI) 2007, pages 281-288, Paris, France, September 30 – October 3, 2007. Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link
  27. Li Ou, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. A Fast Delivery Protocol for Total Order Broadcasting. In Proceedings of the 16th IEEE International Conference on Computer Communications and Networks (ICCCN) 2007, pages 730-734, Honolulu, HI, USA, August 13-16, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-42441-251-8. ISSN 1095-2055. Acceptance rate 29.1% (160/550). Abstract Publication Presentation BibTeX Citation DOI Link
  28. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. Acceptance rate 23.6% (29/123). Most cited paper with 178 citations. Abstract Publication Presentation BibTeX Citation DOI Link
  29. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. On Programming Models for Service-Level High Availability. In Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES) 2007, pages 999-1006, Vienna, Austria, April 10-13, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2775-2. Acceptance rate 28.3% (60/212). Abstract Publication Presentation BibTeX Citation DOI Link
  30. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. Acceptance rate 26% (109/419). Abstract Publication Presentation BibTeX Citation DOI Link
  31. Kai Uhlemann, Christian Engelmann, and Stephen L. Scott. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster) 2006, pages 1-10, Barcelona, Spain, September 25-28, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 1-4244-0328-6. ISSN 1552-5244. Acceptance rate 33.1% (42/127). Abstract Publication Presentation BibTeX Citation DOI Link
  32. Ronald Baumann, Christian Engelmann, and George A. (Al) Geist. A Parallel Plug-in Programming Paradigm. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on High Performance Computing and Communications (HPCC) 2006, pages 823-832, Munich, Germany, September 13-15, 2006. Springer Verlag, Berlin, Germany. ISBN 978-3-540-39368-9. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link
  33. Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. Acceptance rate 26.2% (37/141). Abstract Publication Presentation BibTeX Citation DOI Link
  34. Daniel I. Okunbor, Christian Engelmann, and Stephen L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of the 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006. Abstract Publication BibTeX Citation
  35. Kshitij Limaye, Chokchai (Box) Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard M. Libby, and Kasidit Chanchio. Job-Site Level Fault Tolerance for Cluster and Grid Environments. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster) 2005, pages 1-9, Boston, MA, USA, September 26-30, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7803-9486-0. ISSN 1552-5244. Acceptance rate 39.6% (45/138). Abstract Publication BibTeX Citation DOI Link
  36. Hertong Song, Chokchai (Box) Leangsuksun, Raja Nassar, Yudan Liu, Christian Engelmann, and Stephen L. Scott. UML-based Beowulf Cluster Availability Modeling. In International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161-167, Las Vegas, NV, USA, June 27-30, 2005. CSREA Press. ISBN 1-932415-49-1. BibTeX Citation
  37. Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743. Acceptance rate 35%. Abstract Publication Presentation BibTeX Citation DOI Link

Peer-reviewed Workshop Papers

  1. Thomas Naughton, Garry Smith, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. What is the right balance for performance and isolation with virtualization in HPC?. In Lecture Notes in Computer Science: Proceedings of the 20th European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops: 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Porto, Portugal, August 25, 2014. Springer Verlag, Berlin, Germany. To appear. BibTeX Citation
  2. Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918. Abstract Publication Presentation BibTeX Citation DOI Link
  3. Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Tools for Simulation and Benchmark Generation at Exascale. In Lecture Notes in Computer Science: Proceedings of the 7th Parallel Tools Workshop, Dresden, Germany, September 3-4, 2013. Springer Verlag, Berlin, Germany. To appear. Abstract Publication Presentation BibTeX Citation
  4. Thomas Naughton, Swen Böhm, Christian Engelmann, and Geoffroy Vallée. Using Performance Tools to Support Experiments in HPC Resilience. In Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 727-736, Aachen, Germany, August 26, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-642-54419-4. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link
  5. Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4511-0. ISSN 1530-2016. Abstract Publication Presentation BibTeX Citation DOI Link
  6. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 251-261, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29740-3. Acceptance rate 60.0% (12/20). Abstract Publication BibTeX Citation DOI Link
  7. Thomas Naughton, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. A Case for Virtual Machine based Fault Injection in a High-Performance Computing Environment. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011: 5th Workshop on System-level Virtualization for High Performance Computing (HPCVirt), pages 234-243, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29737. Abstract Publication Presentation BibTeX Citation DOI Link
  8. Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2. Abstract Publication Presentation BibTeX Citation DOI Link
  9. George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. Abstract Publication Presentation BibTeX Citation DOI Link
  10. Thomas Naughton, Wesley Bland, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. Fault Injection Framework for System Resilience Evaluation – Fake Faults for Finding Future Failures. In Proceedings of the 18th International Symposium on High Performance Distributed Computing (HPDC) 2009: 2nd Workshop on Resiliency in High Performance Computing (Resilience) 2009, pages 23-28, Munich, Germany, June 9, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-587-1. Abstract Publication Presentation BibTeX Citation DOI Link
  11. Anand Tikotekar, Hong H. Ong, Sadaf Alam, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and Stephen L. Scott. Performance Comparison of Two Virtual Machine Scenarios Using an HPC Application – A Case study Using Molecular Dynamics Simulations. In Proceedings of the 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2009, in conjunction with the 4th ACM SIGOPS European Conference on Computer Systems (EuroSys) 2009, pages 33-40, Nuremberg, Germany, March 30, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-465-2. Abstract Publication Presentation BibTeX Citation DOI Link
  12. Geoffroy R. Vallée, Thomas Naughton, Hong H. Ong, Anand Tikotekar, Christian Engelmann, Wesley Bland, Ferrol Aderholt, and Stephen L. Scott. Virtual System Environments. In Communications in Computer and Information Science: Proceedings of the 2nd DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and New Technologies (SVM) 2008, pages 72-83, Munich, Germany, October 21-22, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-540-88707-2. ISSN 1865-0929. Abstract Publication BibTeX Citation DOI Link
  13. Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, and Stephen L. Scott. An Analysis of HPC Benchmark Applications in Virtual Machine Environments. In Lecture Notes in Computer Science: Proceedings of the 14th European Conference on Parallel and Distributed Computing (Euro-Par) 2008: 3rd Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC) 2008, pages 63-71, Las Palmas de Gran Canaria, Spain, August 26-29, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-642-00954-9. Abstract Publication Presentation BibTeX Citation DOI Link
  14. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008: Workshop on Resiliency in High Performance Computing (Resilience) 2008, pages 813-818, Lyon, France, May 19-22, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3156-4. Abstract Publication Presentation BibTeX Citation DOI Link
  15. Xin Chen, Benjamin Eckart, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. An Online Controller Towards Self-Adaptive File System Availability and Performance. In Proceedings of the 5th High Availability and Performance Workshop (HAPCW) 2008, in conjunction with the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, April 3-4, 2008. Abstract Publication Presentation BibTeX Citation
  16. Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, Stephen L. Scott, and Anthony M. Filippi. Effects of Virtualization on a Scientific Application – Running a Hyperspectral Radiative Transfer Code on Virtual Machines. In Proceedings of the 2nd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2008, in conjunction with the 3rd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2008, pages 16-23, Glasgow, UK, March 31, 2008. ACM Press, New York, NY, USA. ISBN 978-1-60558-120-0. Abstract Publication Presentation BibTeX Citation DOI Link
  17. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Middleware in Modern High Performance Computing System Architectures. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on Computational Science (ICCS) 2007, Part II: 4th Special Session on Collaborative and Cooperative Environments (CCE) 2007, pages 784-791, Beijing, China, May 27-30, 2007. Springer Verlag, Berlin, Germany. ISBN 3-5407-2585-5. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link
  18. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Transparent Symmetric Active/Active Replication for Service-Level High Availability. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007: 7th International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2833-3. Abstract Publication Presentation BibTeX Citation DOI Link
  19. Christian Engelmann, Stephen L. Scott, Hong H. Ong, Geoffroy R. Vallée, and Thomas Naughton. Configurable Virtualized System Environments for High Performance Computing. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, March 20, 2007. Abstract Publication Presentation BibTeX Citation
  20. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  21. Li Ou, Xin Chen, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. Achieving Computational I/O Effciency in a High Performance Cluster Using Multicore Processors. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  22. Christian Engelmann and George A. (Al) Geist. RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework. In Lecture Notes in Computer Science: Proceedings of the 6th International Conference on Computational Science (ICCS) 2006, Part II: 3rd Special Session on Collaborative and Cooperative Environments (CCE) 2006, pages 573-580, Reading, UK, May 28-31, 2006. Springer Verlag, Berlin, Germany. ISBN 3-540-34381-4. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link
  23. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Active/Active Replication for Highly Available HPC System Services. In Proceedings of the 1st International Conference on Availability, Reliability and Security (ARES) 2006: 1st International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006, pages 639-645, Vienna, Austria, April 20-22, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2567-9. Abstract Publication Presentation BibTeX Citation DOI Link
  24. Christian Engelmann and Stephen L. Scott. Concepts for High Availability in Scientific High-End Computing. In Proceedings of the 3rd High Availability and Performance Workshop (HAPCW) 2005, in conjunction with the 6th Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005. Abstract Publication Presentation BibTeX Citation
  25. Christian Engelmann and Stephen L. Scott. High Availability for Ultra-Scale High-End Scientific Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  26. Chokchai (Box) Leangsuksun, Venkata K. Munganuru, Tong Liu, Stephen L. Scott, and Christian Engelmann. Asymmetric Active-Active High Availability for High-end Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  27. Christian Engelmann and George A. (Al) Geist. A Lightweight Kernel for the Harness Metacomputing Framework. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2005: 14th Heterogeneous Computing Workshop (HCW) 2005, Denver, CO, USA, April 4, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2312-9. ISSN 1530-2075. Abstract Publication Presentation BibTeX Citation DOI Link
  28. Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. High Availability through Distributed Control. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004. Abstract Publication Presentation BibTeX Citation
  29. Xubin (Ben) He, Li Ou, Stephen L. Scott, and Christian Engelmann. A Highly Available Cluster Storage System using Scavenging. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004. Abstract Publication Presentation BibTeX Citation
  30. Christian Engelmann and George A. (Al) Geist. A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform. In Proceedings of the Challenges of Large Applications in Distributed Environments Workshop (CLADE) 2003, in conjunction with the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC) 2003, pages 47, Seattle, WA, USA, June 21, 2003. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-1984-9. Abstract Publication Presentation BibTeX Citation DOI Link
  31. Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Distributed Peer-to-Peer Control in Harness. In Lecture Notes in Computer Science: Proceedings of the 2nd International Conference on Computational Science (ICCS) 2002, Part II: Workshop on Global and Collaborative Computing, pages 720-727, Amsterdam, The Netherlands, April 21-24, 2002. Springer Verlag, Berlin, Germany. ISBN 3-540-43593-X. ISSN 0302-9743. Abstract Publication Presentation BibTeX Citation DOI Link

Peer-reviewed Conference Posters

  1. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, and Kurt Ferreira. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation
  2. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation
  3. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Abstract Publication BibTeX Citation
  4. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-level Virtualization for for High-Performance Computing. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Abstract Publication BibTeX Citation
  5. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009. Abstract Publication BibTeX Citation
  6. George A. (Al) Geist, Christian Engelmann, Jack J. Dongarra, George Bosilca, Magdalena M. Sławińska, and Jarosław K. Sławiński. The Harness Workbench: Unified and Adaptive Access to Diverse High-Performance Computing Platforms. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation
  7. Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, Jyothish Varma, Xubin (Ben) He, Li Ou, and Xin Chen. Resiliency for High-Performance Computing Systems. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation
  8. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-level Virtualization for for High-Performance Computing. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation

Whitepapers

  1. Christian Engelmann and Thomas Naughton. A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing. Whitepaper submitted to the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013, September 18-19, 2013. Abstract Publication Presentation BibTeX Citation
  2. Marc Snir, and Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Bill Carlson, Andrew A. Chien, Pedro Diniz, Christian Engelmann, Rinku Gupta, Fred Johnson, Jim Belak, Pradip Bose, Franck Cappello, Paul Coteus, Nathan A. Debardeleben, Mattan Erez, Saverio Fazzari, Al Geist, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. Workshop report, August 4-11, 2013. Publication BibTeX Citation
  3. Al Geist, Bob Lucas, Marc Snir, Shekhar Borkar, Eric Roman, Mootaz Elnozahy, Bert Still, Andrew Chien, Robert Clay, John Wu, Christian Engelmann, Nathan DeBardeleben, Rob Ross, Larry Kaplan, Martin Schulz, Mike Heroux, Sriram Krishnamoorthy, Lucy Nowell, Abhinav Vishnu, and Lee-Ann Talley. U.S. Department of Energy Fault Management Workshop. Workshop report submitted to the U.S. Department of Energy, June 6, 2012. Abstract Publication BibTeX Citation
  4. Christian Engelmann and Thomas Naughton. A Performance/Resilience/Power Co-design Tool for Extreme-scale High-Performance Computing. Whitepaper submitted to the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2012, August 9-10, 2012. Abstract Publication BibTeX Citation
  5. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Frank Mueller. Dynamic Self-Aware Runtime Software for Exascale Systems. Whitepaper submitted to the U.S. Department of Energy's Exascale Operating Systems and Runtime Technical Council, July, 2012. Abstract Publication Presentation BibTeX Citation
  6. Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. Whitepaper submitted to the U.S. National Science Foundation's High-end Computing Program, December, 2009. Publication BibTeX Citation

Technical Reports

  1. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Technical Report, ORNL/TM-2012/227, Oak Ridge National Laboratory, Oak Ridge, TN, USA, June, 2012. Abstract Publication BibTeX Citation
  2. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments. Technical Report, ORNL/TM-2010/162, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August, 2010. Abstract Publication BibTeX Citation
  3. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Technical Report, ORNL/TM-2010/161, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August, 2010. Abstract Publication BibTeX Citation

Talks and Lectures

  1. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. Fault Tolerance Session. Invited talk at the The ExaChallenge Symposium, Dublin, Ireland, October 16-17, 2012. Presentation BibTeX Citation
  3. Christian Engelmann. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development. Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) Summer Workshop Week on Addressing Failures in Exascale Computing, Park City, UT, USA, August 4-11, 2012. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. Resilience for Permanent, Transient, and Undetected Errors. Invited talk at the 16th Workshop on Distributed Supercomputing (SOS) 2012, Santa Barbara, CA, USA, March 12-15, 2012. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. Scaling To A Million Cores And Beyond: A Basic Understanding Of The Challenges Ahead On The Road To Exascale. Invited talk at the 1st International Workshop on Extreme Scale Parallel Architectures and Systems (ESPAS) 2012, in conjunction with the 7th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) 2012, Paris France, January 24, 2012. Abstract Presentation BibTeX Citation
  6. Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011. Abstract Presentation BibTeX Citation
  7. Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011. Abstract Presentation BibTeX Citation
  8. Christian Engelmann. Scalable HPC System Monitoring. Invited talk at the 3rd HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the 3rd Los Alamos Computer Science Symposium (LACSS) 2010, Santa Fe, NM, USA, October 13, 2010. Abstract Presentation BibTeX Citation
  9. Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010. Presentation BibTeX Citation
  10. Christian Engelmann and Stephen L. Scott. Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond. Talk at the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop, in conjunction with the USENIX Federated Conferences Week (USENIX) 2010, Boston MA, USA, June 22, 2010. Abstract Presentation BibTeX Citation
  11. Christian Engelmann. Resilience Challenges at the Exascale. Talk at the 14th Workshop on Distributed Supercomputing (SOS) 2010, Savannah, GA, USA, March 8-11, 2010. Abstract Presentation BibTeX Citation
  12. Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010. Abstract Presentation BibTeX Citation
  13. Christian Engelmann. High-Performance Computing Research Internship and Appointment Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, December 14, 2009. Abstract Presentation BibTeX Citation
  14. Christian Engelmann. JCAS – IAA Simulation Efforts at Oak Ridge National Laboratory. Invited talk at the IAA Workshop on HPC Architectural Simulation (HPCAS), Boulder, CO, USA, September 1-2, 2009. Presentation BibTeX Citation
  15. Christian Engelmann. Modeling Techniques Towards Resilience. Invited talk at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Presentation BibTeX Citation
  16. Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009. Abstract Presentation BibTeX Citation
  17. Christian Engelmann. High-Performance Computing Research and MSc Internship Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 11, 2009. Abstract Presentation BibTeX Citation
  18. Christian Engelmann. Modular Redundancy for Soft-Error Resilience in Large-Scale HPC Systems. Invited talk at the Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009. Abstract Presentation BibTeX Citation
  19. Christian Engelmann. Proactive Fault Tolerance Using Preemptive Migration. Invited talk at the 3rd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009, Cancun, Mexico, April 22-24, 2009. Abstract Presentation BibTeX Citation
  20. Christian Engelmann. Resiliency. Panel at the 13th Workshop on Distributed Supercomputing (SOS) 2009, Hilton Head, SC, USA, March 9-12, 2009. BibTeX Citation
  21. Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008. Abstract Presentation BibTeX Citation
  22. Christian Engelmann. Modular Redundancy in HPC Systems: Why, Where, When and How?. Invited talk at the 1st HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2008, in conjunction with the 1st Los Alamos Computer Science Symposium (LACSS) 2008, Santa Fe, NM, USA, October 15, 2008. Abstract Presentation BibTeX Citation
  23. Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008. Abstract Presentation BibTeX Citation
  24. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systèmes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008. Abstract Presentation BibTeX Citation
  25. Christian Engelmann. Service-Level High Availability in Parallel and Distributed Systems. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 10, 2007. Abstract Presentation BibTeX Citation
  26. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Khon Kean, Thailand, June 8, 2007. Abstract Presentation BibTeX Citation
  27. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Khon Kean, Thailand, June 4-5, 2007. Abstract Presentation BibTeX Citation
  28. Christian Engelmann. Operating System Research at ORNL: System-level Virtualization. Seminar at the Institute of Graphics and Parallel Processing, Johannes Kepler University, Linz, Austria, April 10, 2007. Abstract Presentation BibTeX Citation
  29. Christian Engelmann. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, March 14, 2007. Abstract Presentation BibTeX Citation
  30. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, June 9, 2006. Abstract Presentation BibTeX Citation
  31. Stephen L. Scott and Christian Engelmann. Advancing Reliability, Availability and Serviceability for High-Performance Computing. Seminar at the Institute of Graphics and Parallel Processing, Johannes Kepler University, Linz, Austria, April 19, 2006. Abstract Presentation BibTeX Citation
  32. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 18, 2005. Abstract Presentation BibTeX Citation
  33. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Mathematics and Computer Science, Fayetteville State University, Fayetteville, NC, USA, September 26, 2005. Abstract Presentation BibTeX Citation
  34. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 13, 2005. Abstract Presentation BibTeX Citation
  35. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Center for Entrepreneurship and Information Technology, Louisiana Tech University, Ruston, LA, USA, April 15, 2005. Abstract Presentation BibTeX Citation
  36. Christian Engelmann. Diskless Checkpointing on Super-scale Architectures – Applied to the Fast Fourier Transform. Invited talk at the 11th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP) 2004, San Francisco, CA, USA, February 25, 2004. Abstract Presentation BibTeX Citation
  37. Christian Engelmann. Super-scalable Algorithms – Next Generation Supercomputing on 100,000 and more Processors. Seminar at the Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA, January 29, 2004. Abstract Presentation BibTeX Citation
  38. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Seminar at the Department of Computer Science, North Carolina State University, Raleigh, NC, USA, February 11, 2004. Abstract Presentation BibTeX Citation

Co-advised Theses

  1. Ian S. Jones. Simulation of Large Scale Architectures on High Performance Computers. Master’s thesis, Department of Computer Science, University of Reading, UK, October 22, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  2. Swen Böhm. Development of a RAS Framework for HPC Environments: Realtime Data Reduction of Monitoring Data. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  3. Frank Lauer. Simulation of Advanced Large-Scale HPC Architectures. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  4. Antonina Litvinova. RAS Framework Engine Prototype. Master’s thesis, Department of Computer Science, University of Reading, UK, September 22, 2009. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  5. Björn Könning. Virtualized Environments for the Harness Workbench. Master’s thesis, Department of Computer Science, University of Reading, UK, March 14, 2007. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  6. Matthias Weber. High Availability for the Lustre File System. Master’s thesis, Department of Computer Science, University of Reading, UK, March 14, 2007. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  7. Ronald Baumann. Design and Development of Prototype Components for the Harness High-Performance Computing Workbench. Master’s thesis, Department of Computer Science, University of Reading, UK, March 6, 2006. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  8. Kai Uhlemann. High Availability for High-End Scientific Computing. Master’s thesis, Department of Computer Science, University of Reading, UK, March 6, 2006. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation

Theses

  1. Christian Engelmann. Symmetric Active/Active High Availability for High-Performance Computing System Services. PhD thesis, Department of Computer Science, University of Reading, UK, 2008. Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading). Abstract Publication Presentation BibTeX Citation
  2. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Master’s thesis, Department of Computer Science, University of Reading, UK, July 7, 2001. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  3. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Master’s thesis, Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany, February 23, 2001. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Computer Science, University of Reading, UK. Advisors: Prof. Uwe Metzler (Technical College for Engineering and Economics (FHTW) Berlin); George A. (Al) Geist (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation

Comments are closed.