1-6hit |
Namyoon WOO Hyungsoo JUNG Heon Young YEOM Taesoon PARK Hyungwoo PARK
Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
This paper presents a fault-tolerance scheme based on mobile agents for the reliable mobile computing systems. Mobility of the agent is suitable to trace the mobile hosts and the intelligence of the agent makes it efficient to support the fault tolerance services. This paper presents two approaches to implement the mobile agent based fault tolerant service and their performances are evaluated and compared with other fault-tolerant schemes.
Fault-tolerant execution of a mobile agent is an important design issue to build a reliable mobile agent system. Several fault-tolerant schemes for a single agent system have been proposed, however, there has been little research result on the multi-agent system. For the cooperating mobile agents, fault-tolerant schemes should consider the inter-agent dependency as well as the mobility; and try to localize the effect of a failure. In this paper, we investigate properties of inter-agent dependency and agent mobility; and then characterize rollback propagation caused by the dependency and the mobility. We then suggest some schemes to localize rollback propagation.
Byoungjoo LEE Taesoon PARK Heon Y. YEOM Yookun CHO
Causal message logging has many benefits such as nonblocking message logging and no rollback propagation. In this paper, we consider the problem of the recovery in causally-logged distributed system and give a condition for consistent recovery. We then show that, based on the impossibility of the consensus, the consistent causal recovery cannot be solved in asynchronous systems.
Fault-tolerance is an important design issue in building a reliable mobile computing system. This paper considers checkpointing recovery services for a mobile computing system based on the ad-hoc network environment. Since potential problems of this new environment are insufficient power and limited storage capacity, the proposed scheme tries to reduce disk access frequency for saving recovery information, and also the amount of information saved for recovery. A brief simulation study has been performed and the results show that the proposed scheme takes advantage of the existing checkpointing recovery schemes.
Inseon LEE Heon Y. YEOM Taesoon PARK
Distributed database systems require a commit process to preserve the ACID property of transactions executed on a number of system sites. With the appearance of main memory database system, the database processing time has been reduced in the order of magnitude, since the database access does not incur any disk access at all. However, when it comes to distributed main memory database systems, the distributed commit process is still very slow since the disk logging at several sites has to precede the transaction commit. In this paper, we re-evaluate various distributed commit protocols and come up with a causal commit protocol suitable for distributed main memory database systems. To evaluate the performance of the proposed commit protocol, extensive simulation study has been performed. The simulation results confirm that the new protocol greatly reduces the time to commit the distributed transactions without any consistency problem.