Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems

Muhammad ALFIAN AMRIZAL, Atsuya UNO, Yukinori SATO, Hiroyuki TAKIZAWA, Hiroaki KOBAYASHI

  • Full Text Views

    0

  • Cite this

Summary :

Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.

Publication
IEICE TRANSACTIONS on Information Vol.E100-D No.12 pp.2749-2760
Publication Date
2017/12/01
Publicized
2017/07/14
Online ISSN
1745-1361
DOI
10.1587/transinf.2017PAP0002
Type of Manuscript
Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)
Category
High performance computing

Authors

Muhammad ALFIAN AMRIZAL
  Tohoku University
Atsuya UNO
  RIKEN
Yukinori SATO
  Tokyo Institute of Technology
Hiroyuki TAKIZAWA
  Tohoku University
Hiroaki KOBAYASHI
  Tohoku University

Keyword

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.