IEICE TRANSACTIONS on Information

  • Impact Factor

    0.59

  • Eigenfactor

    0.002

  • article influence

    0.1

  • Cite Score

    1.4

Advance publication (published online immediately after acceptance)

Volume E87-D No.7  (Publication Date:2004/07/01)

    Special Section on Hardware/Software Support for High Performance Scientific and Engineering Computing
  • FOREWORD

    Minyi GUO  Laurence T. YANG  

     
    FOREWORD

      Page(s):
    1615-1616
  • Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set

    Sung Woo CHUNG  Hyong-Shik KIM  Chu Shik JHON  

     
    PAPER-Networking and System Architectures

      Page(s):
    1617-1624

    In CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time. Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certain value does not reduce the cache miss rate significantly, in this paper, we propose two split L2 caches to utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cache and another cache to reduce the remote memory access time. Both work together to reduce total L2 cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. For another cache, we propose two alternatives: an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total execution time by up to 27%. It is also found that the proposed split L2 caches outperform the traditional single LRU cache of double size.

  • Optimal Multicast Tree Routing for Cluster Computing in Hypercube Interconnection Networks

    Weijia JIA  Bo HAN  Pui On AU  Yong HE  Wanlei ZHOU  

     
    PAPER-Networking and System Architectures

      Page(s):
    1625-1632

    Cluster computation has been used in the applications that demand performance, reliability, and availability, such as cluster server groups, large-scale scientific computations, distributed databases, distributed media-on-demand servers and search engines etc. In those applications, multicast can play the vital roles for the information dissemination among groups of servers and users. This paper proposes a set of novel efficient fault-tolerant multicast routing algorithms on hypercube interconnection of cluster computers using multicast shared tree approach. We present some new algorithms for selecting an optimal core (root) and constructing the shared tree so as to minimize the average delay for multicast messages. Simulation results indicate that our algorithms are efficient in the senses of short end-to-end average delay, load balance and less resource utilizations over hypercube cluster interconnection networks.

  • Robust Transmission of Wavelet Video Sequence over Wireless Communication Channels

    Joo-Kyong LEE  Ki-Dong CHUNG  

     
    PAPER-Networking and System Architectures

      Page(s):
    1633-1640

    Bit-errors in a subband of a wavelet-based video frame during network transmission affect not only lower-level subbands within the same frame but also the subsequent frames. This is because the video frame is wavelet-transformed image with multi-levels and referenced from later frames. In this paper, we propose a new motion estimation scheme for wavelet-based video called Intra-frame Motion Estimation (IME), in which each subband except the LL subband refers to the 1-level-lower subband in the same orientation within the same frame. This scheme protects video quality by confining the effects of the bit-errors of all subbands, except the LL subband, within a frame. We evaluated the performance of our proposed scheme in a simulated wireless network environment. As a result of tests, it was shown that the proposed IME algorithm performs better than MRME, a motion-compensated video coding scheme for wavelet video, in a heavy motion video sequence, while IME outperforms MRME at a high bit-rate in small motion video sequence.

  • Enhancing ICP with P2P Technology: Cost, Availability, and Reconfiguration

    Ping-Jer YEH  Yu-Chen CHUANG  Shyan-Ming YUAN  

     
    PAPER-Networking and System Architectures

      Page(s):
    1641-1648

    Traditional Web cache servers based on HTTP and ICP infrastructure tend to have higher hardware and management cost, have difficulty in availability, automatic and dynamic reconfiguration, and may have slow links to some users. We find that peer-to-peer technology can help solve these problems. The peer cache service (PCS) we proposed here leverages each peer's local cache, similar access patterns, fully distributed coordination, and fast communication channels to enhance response time, scale of cacheable objects, and availability. Moreover, incorporating goals and strategies such as making the protocol lightweight and mutually compatible with existing cache infrastructure, supporting mobile devices, undertaking dynamic three-level caching, and exchanging cache meta-information further improve the effectiveness and differentiate our work from other similar-at-first-glance P2P Web cache systems.

  • Hierarchical Interconnection Networks Based on (3, 3)-Graphs for Massively Parallel Processors

    Gene Eu JAN  Yuan-Shin HWANG  

     
    PAPER-Networking and System Architectures

      Page(s):
    1649-1656

    This paper proposes several novel hierarchical interconnection networks based on the (3, 3)-graphs, namely folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks. Just as the hypercubes, CCC, Peterson-based networks, and Heawood-based networks, these hierarchical networks have the following nice properties: regular topology, high scalability, and small diameters. Due to these important properties, these hierarchical networks seem to have the potential as alternatives for the future interconnection structures of multicomputer systems, especially massively parallel processors (MPPs). Furthermore, this paper will present the routing and broadcasting algorithms for these proposed networks to demonstrate that these algorithms are as elegant as the algorithms for hypercubes, CCC, and Petersen- or Heawood-based networks.

  • Configurable Communication Middleware for Clusters with Multiple Interconnections

    Nader MOHAMED  Jameela AL-JAROODI  Hong JIANG  

     
    PAPER-Networking and System Architectures

      Page(s):
    1657-1665

    High performance scientific and engineering applications running on clusters have different communication requirements. Current cluster configurations typically provide multiple network interfaces per node and multiple interconnections among nodes. However, transport protocols such as TCP do not utilize existing multiple network interfaces to enhance communication performance. This paper introduces a new configurable communication model utilizing multiple interconnections. The model adds mechanisms to manage and enhance the overall communication performance of clusters. These configurations include the use of parallel message transfers, the separation of the transfer channels between small messages and large messages, and load balancing among the channels. The main advantages of the model are: (1) providing a flexible, enhanced network infrastructure, (2) hiding the technical details of the heterogeneous network resources from the applications, and (3) providing an easy and flexible way to extend the network capacities for specific nodes. To illustrate the advantages and performance enhancements of the model, a prototype was implemented to experimentally evaluate the cluster network performance, which showed considerable gains.

  • Speculative Selection Routing in 2D Torus Network

    Tran CONG SO  Shigeru OYANAGI  Katsuhiro YAMAZAKI  

     
    PAPER-Networking and System Architectures

      Page(s):
    1666-1673

    We have proposed a speculative selection function for adaptive routing, which uses idle cycles of the network physical links to exchange network information between nodes, thus helps to decide the best selection. Previous study on the mesh network showed that SSR gives message selection flexibility that improves network performance by balancing the network traffic in both global and local scopes. This paper evaluates the speculative selection function on 2D torus network with simulation. The simulation compares the network throughput and latency with various traffic patterns. The visualization graphs show how the speculative selection eliminates hotspots and disperses traffic in the global scope. The simulation results demonstrate that by using speculative selection, the network performance is increased by around 7%. Compared to the mesh network, the torus's version has smaller gain due to the high performance nature of the torus network.

  • Address Computation in Configurable Parallel Memory Architecture

    Eero AHO  Jarno VANNE  Kimmo KUUSILINNA  Timo D. HAMALAINEN  

     
    PAPER-Networking and System Architectures

      Page(s):
    1674-1681

    Parallel memories increase memory bandwidth with several memory modules working in parallel and can be used to feed a processor with only necessary data. The Configurable Parallel Memory Architecture (CPMA) enables a multitude of access formats and module assignment functions to be used within a single hardware implementation, which has not been possible in prior embedded parallel memory systems. This paper focuses on address computation in CPMA, which is implemented using several configurable computation units in parallel. One unit is dedicated for each type of access formats and module assignment functions that the implementation supports. Timing and area estimates are given for a 0.25-micron CMOS process. The utilized resources are shown to be linearly proportional to the number of memory modules.

  • TLB Update-Hint: A Scalable TLB Consistency Algorithm for Cache-Coherent Non-uniform Memory Access Multiprocessors

    Byeonghag SEONG  Donggook KIM  Yangwoo ROH  Kyuho PARK  Daeyeon PARK  

     
    PAPER-Networking and System Architectures

      Page(s):
    1682-1692

    Shared memory multiprocessors in which each processor has its own TLB must manage consistency among TLBs and a page table. As the large-scale CC-NUMA (cache-coherent non-uniform memory access) shared memory multiprocessors become popular, it is important for TLB consistency management algorithms to be highly scalable. In this paper, we propose a TLB update-hint algorithm as a scalable TLB consistency management solution for CC-NUMA multiprocessors. By using a lazy TLB invalidation approach, we reduced the number of unnecessary processor interruptions and idle-waiting time, and achieved a high level of scalability. Using a shared memory simulator, we evaluated the TLB update-hint algorithm. For performance comparison, we also simulated the TLB shootdown algorithm, one of the most popular TLB consistency algorithms. The simulations demonstrated that the TLB update-hint algorithm scales well in systems with a large number of processors. At 64 node systems, the TLB update-hint algorithm shows 4787% better performance than the TLB shootdown algorithm.

  • Programming Support for MPMD Parallel Computing in ClusterGOP

    Fan CHAN  Jiannong CAO  Alvin T.S. CHAN  Minyi GUO  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1693-1702

    Many parallel applications involve different independent tasks with their own data. Using the MPMD model, programmers can have a modular view and simplified structure of the parallel programs. Although MPI supports both SPMD and MPMD models for programming, MPI libraries do not provide an efficient way for task communication for the MPMD model. We have developed a programming environment, called ClusterGOP, for building and developing parallel applications. Based on the graph-oriented programming (GOP) model, ClusterGOP provides higher-level abstractions for message-passing parallel programming with the support of software tools for developing and running parallel applications. In this paper, we describe how ClusterGOP supports programming of MPMD parallel applications on top of MPI. We discuss the issues of implementing the MPMD model in ClusterGOP using MPI and evaluate the performance by using example applications.

  • Traditional File Systems versus DualFS: A Performance Comparison Approach

    Juan PIERNAS  Toni CORTES  Jose M. GARCIA  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1703-1711

    DualFS is a next-generation journaling file system which has the same consistency guaranties as traditional journaling file systems but better performance. This paper introduces three new enhancements which significantly improve DualFS performance during normal operation, and presents different experimental results which compare DualFS and other traditional file systems, namely, Ext2, Ext3, XFS, JFS, and ReiserFS. The experiments carried out prove, for the first time, that a new file system design based on separation of data and metadata can significantly improve file systems' performance without requiring several storage devices.

  • VLaTTe: A Java Just-in-Time Compiler for VLIW with Fast Scheduling and Register Allocation

    Suhyun KIM  Soo-Mook MOON  Kemal EBCIOLU  Erik ALTMAN  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1712-1720

    For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. We believe higher Java performance can be achieved by exploiting instruction-level parallelism (ILP) in the context of Java JIT compilation. This paper introduces VLaTTe, a Java JIT compiler for VLIW machines that performs efficient scheduling while doing fast register allocation. It is an extended version of our previous JIT compiler for RISC machines called LaTTe whose translation overhead is low (i.e., consistently taking one or two seconds for SPECJVM98 benchmarks) due to its fast register allocation. VLaTTe adds the scheduling capability onto the same framework of register allocation, with a constraint for precise in-order exception handling which guarantees the same Java exception behavior with the original bytecode program. Our experimental results on the SPECJVM98 benchmarks show that VLaTTe achieves a geometric mean of useful IPC 1.7 (2-ALU), 2.1 (4-ALU), and 2.3 (8-ALU), while the scheduling/allocation overhead is 3.6 times longer than LaTTe's on average, which appears to be reasonable.

  • Packing/Unpacking Using MPI User-Defined Datatypes for Efficient Data Redistribution

    Sheng-Wen BAI  Chu-Sing YANG  Tsung-Chuan HUANG  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1721-1728

    In many parallel programs, run-time data redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. Research on data redistribution algorithms has recently matured. The time required to generate data sets and processor sets is much lesser than before. Therefore, packing/unpacking has become a relatively high cost in redistribution. In this paper, we present methods to perform BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution, using MPI user-defined datatypes. This method reduces the required memory buffers and avoids unnecessary movement of data. Theoretical models are presented to determine the best method for redistribution. The methods were implemented on an IBM SP2 parallel machine to evaluate the performance of the proposed methods. The experimental results indicate that this approach can clearly improve the redistribution in most cases.

  • Proposal of a Tree Load Balancing Algorithm to Grid Computing Environments

    Rodrigo Fernandes de MELLO  Erico C. T. de MATTOS  Luis Carlos TREVELIN  Maria Stela Veludo de PAIVA  Laurence T. YANG  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1729-1736

    The availability of a low cost hardware has increased the development of distributed systems, by making then more and more accessible. In order to optimize the resources allocation on the distributed systems, some load balancing algorithms have been proposed. These algorithms distribute the application loads over the environment computers, make homogeneous the occupation of the whole environment and increase the application performance. This equal distribution prevents certain computers to get overloaded, to the detriment of the idleness of the other ones. This article proposes and analyzes the TLBAGrid, a load balancing algorithm for Grid computing environments.

  • Dynamic Code Repositioning for Java

    Shinji TANAKA  Tetsuyasu YAMADA  Satoshi SHIRAISHI  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1737-1742

    The sizes of recent Java-based server-side applications, like J2EE containers, have been increasing continuously. Past techniques for improving the performance of Java applications have targeted relatively small applications. Moreover, when the methods of these small target applications are invoked, they are not usually distributed over the entire memory space. As a result, these techniques cannot be applied efficiently to improve the performance of current large applications. We propose a dynamic code repositioning approach to improve the hit rates of instruction caches and translation look-aside buffers. Profiles of method invocations are collected when the application performs with its heaviest processor load, and the code is repositioned based on these profiles. We also discuss a method-splitting technique to significantly reduce the sizes of methods. Our evaluation of a prototype implementing these techniques indicated 5% improvement in the throughput of the application.

  • Algorithmic Concept Recognition to Support High Performance Code Reengineering

    Beniamino DI MARTINO  

     
    PAPER-Software Support and Optimization Techniques

      Page(s):
    1743-1750

    Techniques for automatic program recognition, at the algorithmic level, could be of high interest for the area of Software Maintenance, in particular for knowledge based reengineering, because the selection of suitable restructuring strategies is mainly driven by algorithmic features of the code. In this paper an automated hierarchical concept parsing recognition technique, and a formalism for the specification of algorithmic concepts, is presented. Based on this technique, the design and development of ALCOR, a production rule based system for automatic recognition of algorithmic concepts within programs, aimed at support of knowledge based reengineering for high performance, is presented.

  • A Two-Dimensional Quantum Transport Simulation of Nanoscale Double-Gate MOSFETs Using Parallel Adaptive Technique

    Yiming LI  Shao-Ming YU  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1751-1758

    In this paper we apply a parallel adaptive solution algorithm to simulate nanoscale double-gate metal-oxide-semiconductor field effect transistors (MOSFETs) on a personal computer (PC)-based Linux cluster with the message passing interface (MPI) libraries. Based on a posteriori error estimation, the triangular mesh generation, the adaptive finite volume method, the monotone iterative method, and the parallel domain decomposition algorithm, a set of two-dimensional quantum correction hydrodynamic (HD) equations is solved numerically on our constructed cluster system. This parallel adaptive simulation methodology with 1-irregular mesh was successfully developed and applied to deep-submicron semiconductor device simulation in our recent work. A 10 nm n-type double-gate MOSFET is simulated with the developed parallel adaptive simulator. In terms of physical quantities and refined adaptive mesh, simulation results demonstrate very good accuracy and computational efficiency. Benchmark results, such as load-balancing, speedup, and parallel efficiency are achieved and exhibit excellent parallel performance. On a 16 nodes PC-based Linux cluster, the maximum difference among CPUs is less than 6%. A 12.8 times speedup and 80% parallel efficiency are simultaneously attained with respect to different simulation cases.

  • A Parallel Implementation of Multi-Domain High-Order Navier-Stokes Equations Using MPI

    Hui WANG  Minyi GUO  Daming WEI  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1759-1765

    In this paper, Message Passing Interface (MPI) techniques are used to implement high-order full 3-D Navier-Stokes equations in multi-domain applications. A two-domain interface with five-point overlapping used previously is expanded to a multi-domain computation. There are normally two approaches for this expansion. One is to break up the domain into two parts through domain decomposition (say, one overlapping), then using MPI directives to further break up each domain into n parts. Another is to break the domain up into 2n parts with (2n-1) overlappings. In our present effort, the latter approach is used and finite-size overlappings are employed to exchange data between adjacent multi-dimensional sub-domains. It is an alternative way to parallelize the high-order full 3-D Navier-Stokes equations into multi-domain applications without adding much complexity. Results with high-order boundary treatments show consistency among multi-domain calculations and single-domain results.

  • An Acceleration Processor for Data Intensive Scientific Computing

    Cheong Ghil KIM  Hong-Sik KIM  Sungho KANG  Shin Dug KIM  Gunhee HAN  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1766-1773

    Scientific computations for diffusion equations and ANNs (Artificial Neural Networks) are data intensive tasks accompanied by heavy memory access; on the other hand, their computational complexities are relatively low. Thus, this type of tasks naturally maps onto SIMD (Single Instruction Multiple Data stream) parallel processing with distributed memory. This paper proposes a high performance acceleration processor of which architecture is optimized for scientific computing using diffusion equations and ANNs. The proposed architecture includes a customized instruction set and specific hardware resources which consist of a control unit (CU), 16 processing units (PUs), and a non-linear function unit (NFU) on chip. They are effectively connected with dedicated ring and global bus structure. Each PU is equipped with an address modifier (AM) and 16-bit 1.5 k-word local memory (LM). The proposed processor can be easily expanded by multi-chip expansion mode to accommodate to a large scale parallel computation. The prototype chip is implemented with FPGA. The total gate count is about 1 million with 530, 432-bit embedded memory cells and it operates at 15 MHz. The functionality and performance of the proposed processor is verified with simulation of oil reservoir problem using diffusion equations and character recognition application using ANNs. The execution times of two applications are compared with software realizations on 1.7 GHz Pentium IV personal computer. Though the proposed processor architecture and the instruction set are optimized for diffusion equations and ANNs, it provides flexibility to program for many other scientific computation algorithms.

  • A Super-Programming Technique for Large Sparse Matrix Multiplication on PC Clusters

    Dejiang JIN  Sotirios G. ZIAVRAS  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1774-1781

    The multiplication of large spare matrices is a basic operation in many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. For a parallel environment, it is essential to partition the entire operation into well balanced tasks and assign them to individual processing elements. Most of the existing techniques partition the given matrices based on some kind of workload estimation. For irregular sparse matrices on PC clusters, however, the workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may degrade performance. In this paper, we apply our super-programming approach to parallel large matrix multiplication on PC clusters. In our approach, tasks are partitioned into super-instructions that are dynamically assigned to member computer nodes. Thus, the load balancing logic is separated from the computing logic; the former is taken over by the runtime environment. Our super-programming approach facilitates ease of program development and targets high efficiency in dynamic load balancing. Workloads can be balanced effectively and the optimization overhead is small. The results prove the viability of our approach.

  • Fast Parallel Solution for Set-Packing and Clique Problems by DNA-Based Computing

    Michael (Shan-Hui) HO  Weng-Long CHANG  Minyi GUO  Laurence T. YANG  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1782-1788

    This paper shows how to use sticker to construct solution space of DNA for the library sequences in the set-packing problem and the clique problem. Then, with biological operations, we propose DNA-based algorithms to remove illegal solutions and to find legal solutions for the set-packing and clique problems from the solution space of sticker. Any NP-complete problem in Cook's Theorem can be reduced and solved by the proposed DNA-based computing approach if its size is equal to or less than that of the set-packing problem. Otherwise, Cook's Theorem is incorrect on DNA-based computing and a new DNA algorithm should be developed from the characteristics of the NP-complete problem. Finally, the result to DNA simulation is given.

  • I/O-Efficient Multilevel Graph Partitioning Algorithm for Massive Graph Data

    Jun-Ho HER  R.S. RAMAKRISHNA  

     
    PAPER-Scientific and Engineering Computing with Applications

      Page(s):
    1789-1794

    Graph data in large scientific/engineering applications are often too massive to fit inside the computer's main memory. The resulting input/output (I/O) costs could be a major performance bottleneck. This paper proposes an extension to extant multilevel graph partitioning algorithms with improved I/O-efficiency. The input graph is envisioned as the union of disjoint blocks (subgraphs) of almost the same size. Each block is coarsened in turn. Recursive matching and contraction are the operations in this phase. All the coarsened blocks are then merged in an iterative manner in order to ensure that the resulting graph fits in the main memory. This graph is then treated with an in-core multilevel graph partitioning algorithm in the usual way. Our experimental results show that the larger graph size is, the more dependent on the I/O-efficiency the performance is. And our modification can easily partition very large graphs. It also exhibits considerable improvement in I/O-complexity.

  • Database Allocation Modeling for Optimal Design of Distributed Systems

    Jae-Woo LEE  Doo-Kwon BAIK  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1795-1804

    By using distributed database systems, many advantages can be obtained such as database management cost, efficiency, and high integrity of systems through allocating fragments to many distributed sites with horizontal/vertical fragmentation of global database schema. To minimize costs, distributed algorithms must be applied so that database fragments are allocated to optimal sites. It is useful to replicate fragments, such as allocating many copies in many sites including load balancing. But there are too many possible combinations of each site and fragment, making it impossible to find a solution in real time, i.e., it is an NP-complete problem. This paper proposes near optimal heuristic algorithms for minimizing cost by defining a cost model based on read and update queries that are requested in many sites. Various factors are applied to the proposed algorithms for sizing efficient network resources that compute database transactions as remote query or update requests for consistency in replicated database systems. For network load balancing, incoming network traffic table is defined in each site. A request transaction from unallocated sites to allocated sites can be accessed properly at any other replicated sites by using the network traffic table. Finally, some experimental results verified the proposed algorithms by comparing actual cases of database allocation.

  • A Distributed 3D Rendering Application for Massive Data Sets

    Huabing ZHU  Tony K.Y. CHAN  Lizhe WANG  Reginald C. JEGATHESE  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1805-1812

    This paper presents a prototype of a distributed 3D rendering system in a hierarchical Grid environment. 3D rendering with massive data sets is a computationally intensive task. In order to make full use of computational resources on Grids, a hierarchical system architecture is designed to run over multiple clusters. This architecture involves both sort-first and sort-last parallel rendering algorithms to achieve excellent scalability, rendering performance and load balance.

  • High System Availability Using Neighbor Replication on Grid

    Mustafa MAT DERIS  Noraziah AHMAD  Md. Yazid Mohd SAMAN  Noraida ALI  Youwei YUAN  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1813-1819

    Data Replication can be used to improve the system availability of distributed systems. In such a system, a mechanism is required to maintain the consistency of the replicated data. The grid structure technique based on quorum is one of the solutions to perform this while providing a high availability of the system. It was shown in the study that, it still requires a bigger number of copies be made available to construct a quorum. So it is not suitable for large systems. In this paper, we propose a technique called the neighbor replication on grid (NRG) technique by considering only neighbors to have the replicated data. In comparison to the grid structure technique, NRG requires a lower communication cost for an operation, while providing a higher system availability, which is preferred for large systems.

  • MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

    Namyoon WOO  Hyungsoo JUNG  Heon Young YEOM  Taesoon PARK  Hyungwoo PARK  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1820-1828

    Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.

  • Evaluation of the Feedback Guided Dynamic Loop Scheduling (FGDLS) Algorithms

    Sabin TABIRCA  Tatiana TABIRCA  Laurence T. YANG  Len FREEMAN  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1829-1833

    In this paper we consider the Feedback-Guided Dynamic Loop Scheduling (FGDLS) method that was proposed by Bull. The method uses a feedback-guided mechanism to schedule a parallel loop within a sequential outer loop. The execution times and the scheduling bounds at a outer iteration are used to find the scheduling bound of the next outer iteration. In this way FGDLS achieves an optimal load balance. Two algorithms have been proposed so far by Tabirca et al. In this article we will review these two algorithms and will give a comparison between their performances.

  • Ω Line Problem in Optimistic Log-Based Rollback Recovery Protocol

    MaengSoon BAIK  SungJin CHOI  ChongSun HWANG  JoonMin GIL  ChanYeol PARK  HeonChang YOO  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1834-1842

    Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols are based on a Piecewise Deterministic (PWD) Assumption model. They, however, assumed that all logged non-deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays and no global time source. In addition, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in recovery time.

  • A Grid Portal to Support High-Performance Scientific Computing on Distributed Resources

    Jacobo TARRIO  Juan TOURIÑO  María J. MARTIN  Patricia GONZALEZ  Ramon DOALLO  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1843-1849

    Grid computing can help to promote high-performance computing at a low overall cost by encouraging research centers to share their resources. However, research staff usually finds it quite hard to use Grids effectively, due to the need of installing and managing new Grid software. Thus, Grid portals are created, making it easier to take advantage of the full capability of the Grid, favoring in this way its use. The goal of this paper is to describe the process of design and implementation of a Grid Portal with the aim of both supporting distributed high-performance resources and make its use by researchers as transparent as possible. This portal uses standard Grid and Web technologies. We have designed the portal so that it can be adapted to different existing Grid infrastructures, based on the Globus Toolkit, and new functionalities can be easily added. The first prototype of the portal has been tested on an experimental Grid platform, and we present encouraging experiences carried out there.

  • A Rank-Based Selection Method of Materialized Queries for Efficient Query Evaluation in a Mediator

    Kil Hong JOO  Won Suk LEE  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1850-1858

    This paper proposes an efficient query evaluation scheme for a mediator system intended to integrate heterogeneous computing environment in terms of operating systems, database management systems, and other software. Most of mediator systems transform a global query into a set of sub-queries based on their target remote servers. Each sub-query is evaluated by the query modification method to evaluate a global query. However, it is possible to reduce the evaluation cost of a global query when the results of frequently requested sub-queries are materialized in a mediator. In a mediator, its integrating schema can be incrementally modified and the evaluation frequency of a global query can also be continuously varied. In order to select the optimized set of materialized sub-queries with respect to their current evaluation frequencies, the proposed method applies a decay factor for modeling the recent access behavior of each sub-query. In other words, the latest access of a sub-query gets the highest attention in the selection process of materialized sub-queries. As a result, it is possible to adjust the optimized set of materialized sub-queries adaptively according to the recent changes in the evaluation frequencies of sub-queries. Since finding the optimum solution of this problem is NP-hard, it takes too long to be used in practice when the number of sub-queries is large. Consequently, given the size of mediator storage, the rank-based selection algorithm proposed in this paper finds the set of materialized sub-queries which minimizes the total evaluation cost of global queries in linear search complexity.

  • Allocation of Tasks in a DCS Using a Different Approach with A* Considering Load

    Biplab KUMER SARKER  Anil KUMAR TRIPATHI  Deo PRAKASH VIDYARTHI  Laurence T. YANG  Kuniaki UEHARA  

     
    PAPER-Distributed, Grid and P2P Computing

      Page(s):
    1859-1866

    In a Distributed Computing Systems (DCS) tasks submitted to it, are usually partitioned into different modules and these modules may be allocated to different processing nodes so as to achieve minimum turn around time of the tasks utilizing the maximum resources of the existing system such as CPU speed, memory capacities etc. The problem lies on how to obtain the optimal allocation of these multiple tasks by keeping in mind that no processing node is overloaded due to this allocation. This paper proposes an algorithm A*RS, using well-known A*, which aims to reduce the search space and time for task allocation. It aims at minimization of turn around time of tasks in the way so that processing nodes do not become overloaded due to this allocation. Our experimental results justify the claims with necessary supports by comparing it with the earlier algorithm for multiple tasks allocation.

  • Regular Section
  • Efficient Architectures for the Biorthogonal Wavelet Transform by Filter Bank and Lifting Scheme

    Yeu-Horng SHIAU  Jer Min JOU  Chin-Chi LIU  

     
    PAPER-VLSI Systems

      Page(s):
    1867-1877

    In this paper, two efficient VLSI architectures for biorthogonal wavelet transform are proposed. One is constructed by the filter bank implementation and another is constructed by the lifting scheme. In the filter bank implementation, due to the symmetric property of biorthogonal wavelet transform, the proposed architecture uses fewer multipliers than the orthogonal wavelet transform. Besides, the polyphase decomposition is adopted to speed up the processing by a factor of 2. In the lifting scheme implementation, the pipeline-scheduling technique is employed to optimize the architecture. Both two architectures are with advantages of lower implementation complexity and higher throughput rate. Moreover, they can also be applied to realize the inverse DWT efficiently. Based on the above properties, the two architectures can be applied to time-critical image compressions, such as JPEG2000. Finally, the architecture constructed by the lifting scheme is implemented into a single chip on 0.35 µm 1P4M CMOS technology, and its area and working performance are 5.005 5.005 mm2 and 50 MHz, respectively.

  • A Statistical Time Synchronization Method for Frequency-Synchronized Network Clocks in Distributed Systems

    Takao YAMASHITA  Satoshi ONO  

     
    PAPER-Computer Systems

      Page(s):
    1878-1886

    In this paper, we propose a statistical method of time synchronization for computer clocks that have precisely frequency-synchronized oscillators. This method not only improves the accuracy of time synchronization but also prevents degradation in the frequency stability of the precise oscillators when the errors in the measured time offsets between computer clocks caused by network traffic possess a Gaussian distribution. Improved accuracy of time synchronization is achieved by estimating the confidence interval of the measured time offsets between the clocks. Degradation in frequency stability is prevented by eliminating unnecessary time correction for the computer clock, because time correction generally causes changes in the frequency accuracy and stability of the precise oscillators. To eliminate unnecessary time correction, our method uses an extended hypothesis test of the difference between the current mean and the mean at the last time adjustment to determine whether time correction is needed. Evaluation by simulating changes in the time offset of the existing ISDN clock synchronization system showed that this method achieves accurate time and stable frequency synchronization.

  • Dynamic Communication Performance of a Hierarchical Torus Network under Non-uniform Traffic Patterns

    M. M. Hafizur RAHMAN  Susumu HORIGUCHI  

     
    PAPER-Computer Systems

      Page(s):
    1887-1896

    Interconnection networks play a crucial role in the performance of massively parallel computers. Hierarchical interconnection networks provide high performance at low cost by exploring the locality that exists in the communication patterns of massively parallel computers. A Hierarchical Torus Network (HTN) is a 2D-torus network of multiple basic modules, in which the basic modules are 3D-torus networks that are hierarchically interconnected for higher level networks. The static network performance of the HTN has already been studied and has been shown to be good. Dynamic communication performance has been evaluated under uniform traffic pattern but not under non-uniform traffic patterns. In this paper, we present a deadlock-free routing algorithm for the HTN using 3 virtual channels and evaluate the network's dynamic communication performance under three non-uniform traffic patterns, using the proposed routing algorithm. We evaluate the dynamic communication performance of HTN, H3D-mesh, H3D-torus, TESH, and mesh networks by computer simulation. We find that the dynamic communication performance of HTN is better than that of the H3D-mesh, H3D-torus, TESH, and mesh networks.

  • Design and Implementation of Markup Language for Integrating of Motion Capture Data Formats

    Hyun-Sook CHUNG  Yillbyung LEE  

     
    PAPER-System Programs

      Page(s):
    1897-1909

    Motion capture technology is widely used to make a realistic motion in these days. Different motion capture devices use different motion capture data formats. Because of the lack of compatibility of motion capture data animators can't reuse the already captured motion sequence. In addition, it is difficult for integrating, storing and retrieving motion capture data with different formats in the storage. In this paper, we propose a standard format for integrating a different motion capture data formats. In addition, we propose a framework of a system that manages motion capture data using our standard format. Our standard format is called MCML (Motion Capture Markup Language). It is a markup language for motion capture data and is based on XML (extensible Markup Language). Our system designed to manage motion capture data consists of a several components -- Mocap Syntax Analyzer, MCML Converter, MCML Editor, Motion Viewer, MCML Storage Wrapper.

  • Multiple DNA Sequences Alignment Using Heuristic-Based Genetic Algorithm

    Chih-Chin LAI  Shih-Wei CHUNG  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Page(s):
    1910-1916

    The alignment of biological sequences is a crucial tool in molecular biology and genome analysis. A wide variety of approaches has been proposed for multiple sequence alignment problem; however, some of them need prerequisites to help find the best alignment or some of them may suffer from the drawbacks of complexity and memory requirement so they can be only applied to cases with a limited number of sequences. In this paper, we view the multiple sequence alignment problem as an optimization problem and propose a heuristic-based genetic algorithm (GA) approach to solve it. The heuristic/GA hybrid yields better results than other well-known packages do. Experimental results are presented to illustrate the feasibility of the proposed approach.

  • A Statistical Method of Evaluating Pronunciation Proficiency for English Words Spoken by Japanese

    Seiichi NAKAGAWA  Naoki NAKAMURA  Kazumasa MORI  

     
    PAPER-Speech and Hearing

      Page(s):
    1917-1922

    In this paper, we propose a statistical method of evaluating the pronunciation proficiency of English words spoken by Japanese. We analyzed statistically the utterances to note a combination that has a high correlation between an English teacher's score and certain acoustic features. We obserbed that the phoneme recognition rates (correct rate and accuracy) were the best measure of pronunciation proficiency, and the likelihood ratio of English phoneme acoustic models to phoneme acoustic models adapted by Japanese was the second best measure. The effective measure which was highly correlated with the English teacher's score was the combination of the likelihood for American native models, likelihood for English models adapted by Japanese, the best likelihood for arbitrary sequences of acoustic models, phoneme recognition rate and the rate of speech. We obtained a correlation coefficient of 0.81 with an open data for vocabulary and 0.69 with open data for speaker at the five words set level, respectively. The coefficient was higher than the correlation between humans' scores, 0.65. In the 15 words set level which corresponds to one or two sentences, we obtained the correlation coefficient of 0.86 with open data for the speaker.

  • Auto Focusing Algorithm for Iris Recognition Camera Using Corneal Specular Reflection

    Kang Ryoung PARK  

    This paper was deleted on March 10, 2006 because it was found to be a duplicate submission (see details in the pdf file).
     
    PAPER-Image Processing and Video Processing

      Page(s):
    1923-1934

    Iris recognition is used to identify a user based on the iris texture information which exists between the white sclera and the black pupil. For fast iris recognition, it is very important to capture user's focused eye image at fast speed. If not, the total recognition time is increased and it makes the user feel much inconvenience. In previous researches and systems, they use the focusing method which has been used for general landscape scene without considering the characteristics of iris image. So, they take much focusing time sometimes, especially in case of the user with glasses. To overcome such problems, we propose a new iris image acquisition method to capture user's focused eye image at very fast speed based on the corneal specular reflection. Experimental results show that the focusing time for both the users with glasses and without glasses is average 480 ms and we can conclude our method can be used for the real-time iris recognition camera.

  • Multi-Stage Unsupervised Learning for Multi-Body Motion Segmentation

    Yasuyuki SUGAYA  Kenichi KANATANI  

     
    PAPER-Image Recognition, Computer Vision

      Page(s):
    1935-1942

    Many techniques have been proposed for segmenting feature point trajectories tracked through a video sequence into independent motions, but objects in the scene are usually assumed to undergo general 3-D motions. As a result, the segmentation accuracy considerably deteriorates in realistic video sequences in which object motions are nearly degenerate. In this paper, we propose a multi-stage unsupervised learning scheme first assuming degenerate motions and then assuming general 3-D motions and show by simulated and real video experiments that the segmentation accuracy significantly improves without compromising the accuracy for general 3-D motions.

  • Multi-Modal Neural Networks for Symbolic Sequence Pattern Classification

    Hanxi ZHU  Ikuo YOSHIHARA  Kunihito YAMAMORI  Moritoshi YASUNAGA  

     
    PAPER-Biocybernetics, Neurocomputing

      Page(s):
    1943-1952

    We have developed Multi-modal Neural Networks (MNN) to improve the accuracy of symbolic sequence pattern classification. The basic structure of the MNN is composed of several sub-classifiers using neural networks and a decision unit. Two types of the MNN are proposed: a primary MNN and a twofold MNN. In the primary MNN, the sub-classifier is composed of a conventional three-layer neural network. The decision unit uses the majority decision to produce the final decisions from the outputs of the sub-classifiers. In the twofold MNN, the sub-classifier is composed of the primary MNN for partial classification. The decision unit uses a three-layer neural network to produce the final decisions. In the latter type of the MNN, since the structure of the primary MNN is folded into the sub-classifier, the basic structure of the MNN is used twice, which is the reason why we call the method twofold MNN. The MNN is validated with two benchmark tests: EPR (English Pronunciation Reasoning) and prediction of protein secondary structure. The reasoning accuracy of EPR is improved from 85.4% by using a three-layer neural network to 87.7% by using the primary MNN. In the prediction of protein secondary structure, the average accuracy is improved from 69.1% of a three-layer neural network to 74.6% by the primary MNN and 75.6% by the twofold MNN. The prediction test is based on a database of 126 non-homologous protein sequences.

  • A Chaotic Maximum Neural Network for Maximum Clique Problem

    Jiahai WANG  Zheng TANG  Ronglong WANG  

     
    PAPER-Biocybernetics, Neurocomputing

      Page(s):
    1953-1961

    In this paper, based on maximum neural network, we propose a new parallel algorithm that can escape from local minima and has powerful ability of searching the globally optimal or near-optimum solution for the maximum clique problem (MCP). In graph theory a clique is a completely connected subgraph and the MCP is to find a clique of maximum size of a graph. The MCP is a classic optimization problem in computer science and in graph theory with many real-world applications, and is also known to be NP-complete. Lee and Takefuji have presented a very powerful neural approach called maximum neural network for this NP-complete problem. The maximum neural model always guarantees a valid solution and greatly reduces the search space without a burden on the parameter-tuning. However, the model has a tendency to converge to the local minimum easily because it is based on the steepest descent method. By adding a negative self-feedback to the maximum neural network, we proposed a parallel algorithm that introduces richer and more flexible chaotic dynamics and can prevent the network from getting stuck at local minima. After the chaotic dynamics vanishes, the proposed algorithm is then fundamentally reined by the gradient descent dynamics and usually converges to a stable equilibrium point. The proposed algorithm has the advantages of both the maximum neural network and the chaotic neurodynamics. A large number of instances have been simulated to verify the proposed algorithm.

  • A Low-Power Tournament Branch Predictor

    Sung Woo CHUNG  Gi Ho PARK  Sung Bae PARK  

     
    LETTER-Computer Systems

      Page(s):
    1962-1964

    This letter proposes a low-power tournament branch predictor, in which the number of accesses to the branch predictors (local predictor or global predictor) is reduced. Analysis results with Samsung Memory Compiler show that the proposed branch predictor reduces the power consumption by 24-45%, compared to the conventional tournament branch predictor, not requiring any additional storage arrays, not incurring any additional delay and never harming accuracy.

  • A Method to Preserve Layered Architectural Style in Development Phases

    Chanjin PARK  Euyseok HONG  Chisu WU  

     
    LETTER-Software Engineering

      Page(s):
    1965-1970

    This paper proposes a new type of relationship between layers in layered architecture and shows how to concretize the relationship between layers into design constraints. The meaning of layer relationship is explained with examples from design patterns and Microsoft COM. In addition, a prototype tool to check conformance is implemented and the architecture document of an open-source software project is checked against the actual architecture extracted from source code developed by many international developers. As a result of checking, parts that do not conform to the architecture document are investigated and it is pointed out that their modifications should be controlled with caution.

  • A Local Learning Framework Based on Multiple Local Classifiers

    BaekSop KIM  HyeJeong SONG  JongDae KIM  

     
    LETTER-Pattern Recognition

      Page(s):
    1971-1973

    This paper presents a local learning framework in which the local classifiers can be pre-learned and the support size of each classifier can be selected to minimize the error bound. The proposed algorithm is compared with the conventional support vector machine (SVM). Experimental results show that our scheme using the user-defined parameters C and σ is more accurate and less sensitive than the conventional SVM.

  • Harmonic Model Based Excitation Enhancement for Low-Bit-Rate Speech Coding

    Hong Kook KIM  Mi Suk LEE  Chul Hong KWON  

     
    LETTER-Speech and Hearing

      Page(s):
    1974-1977

    A new excitation enhancement technique based on a harmonic model is proposed in this paper to improve the speech quality of low-bit-rate speech coders. This technique is employed only in the decoding process of speech coders and improves high-frequency components of excitation. We develop the procedure of harmonic model parameters estimation and harmonic generation and apply the technique to a current state-of-art low bit rate speech coder. Experiments on spectrum reading and spectrum distortion measurement show that the proposed excitation enhancement technique improves speech quality.

  • Distorted Speech Rejection for Automatic Speech Recognition in Wireless Communication

    Joon-Hyuk CHANG  Nam Soo KIM  

     
    LETTER-Speech and Hearing

      Page(s):
    1978-1981

    This letter introduces a pre-rejection technique for wireless channel distorted speech with application to automatic speech recognition (ASR). Based on analysis of distorted speech signals over a wireless communication channel, we propose a method to reject the channel distorted speech with a small computational load. From a number of simulation results, we can discover that the pre-rejection algorithm enhances the robustness of speech recognition operation.

  • Document Genre Classification for User Interface of Web Search Engine

    Kong-Joo LEE  

     
    LETTER-Natural Language Processing

      Page(s):
    1982-1986

    In this letter we suggest sets of features to classify genres of web documents. Web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances. On the basis of the experimental results, we implement a user interface of a web search engine that presents documents grouped by genres.

  • Continuous Optimization for Item Selection in Collaborative Filtering

    Kohei INOUE  Kiichi URAHAMA  

     
    LETTER-Biocybernetics, Neurocomputing

      Page(s):
    1987-1988

    A method is presented for selecting items asked for new users to input their preference rates on those items in recommendation systems based on the collaborative filtering. Optimal item selection is formulated by an integer programming problem and we solve it by using a kind of the Hopfield-network-like scheme for interior point methods.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.