Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
Muhammad ALFIAN AMRIZAL
Tohoku University
Atsuya UNO
RIKEN
Yukinori SATO
Tokyo Institute of Technology
Hiroyuki TAKIZAWA
Tohoku University
Hiroaki KOBAYASHI
Tohoku University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Muhammad ALFIAN AMRIZAL, Atsuya UNO, Yukinori SATO, Hiroyuki TAKIZAWA, Hiroaki KOBAYASHI, "Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems" in IEICE TRANSACTIONS on Information,
vol. E100-D, no. 12, pp. 2749-2760, December 2017, doi: 10.1587/transinf.2017PAP0002.
Abstract: Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.2017PAP0002/_p
Copy
@ARTICLE{e100-d_12_2749,
author={Muhammad ALFIAN AMRIZAL, Atsuya UNO, Yukinori SATO, Hiroyuki TAKIZAWA, Hiroaki KOBAYASHI, },
journal={IEICE TRANSACTIONS on Information},
title={Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems},
year={2017},
volume={E100-D},
number={12},
pages={2749-2760},
abstract={Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.},
keywords={},
doi={10.1587/transinf.2017PAP0002},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems
T2 - IEICE TRANSACTIONS on Information
SP - 2749
EP - 2760
AU - Muhammad ALFIAN AMRIZAL
AU - Atsuya UNO
AU - Yukinori SATO
AU - Hiroyuki TAKIZAWA
AU - Hiroaki KOBAYASHI
PY - 2017
DO - 10.1587/transinf.2017PAP0002
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2017
AB - Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
ER -