Shota FUJII Shohei KAKEI Masanori HIROTOMO Makoto TAKITA Yoshiaki SHIRAISHI Masami MOHRI Hiroki KUZUNO Masakatu MORII
Haoran LUO Tengfei SHAO Tomoji KISHI Shenglei LI
Chee Siang LEOW Tomoki KITAGAWA Hideaki YAJIMA Hiromitsu NISHIZAKI
Dengtian YANG Lan CHEN Xiaoran HAO
Rong HUANG Yue XIE
Toshiki ONISHI Asahi OGUSHI Ryo ISHII Akihiro MIYATA
Meihua XUE Kazuki SUGITA Koichi OTA Wen GU Shinobu HASEGAWA
Jinyong SUN Zhiwei DONG Zhigang SUN Guoyong CAI Xiang ZHAO
Yusuke HIROTA Yuta NAKASHIMA Noa GARCIA
Yusuke HIROTA Yuta NAKASHIMA Noa GARCIA
Kosetsu TSUKUDA Tomoyasu NAKANO Masahiro HAMASAKI Masataka GOTO
ZhengYu LU PengFei XU
Binggang ZHUO Ryota HONDA Masaki MURATA
Qingqing YU Rong JIN
Huawei TAO Ziyi HU Sixian LI Chunhua ZHU Peng LI Yue XIE
Qianhang DU Zhipeng LIU Yaotong SONG Ningning WANG Zeyuan JU Shangce GAO
Ryota TOMODA Hisashi KOGA
Reina SASAKI Atsuko TAKEFUSA Hidemoto NAKADA Masato OGUCHI
So KOIDE Yoshiaki TAKATA Hiroyuki SEKI
Huang Rong Qian Zewen Ma Hao Han Zhezhe Xie Yue
Huu-Long PHAM Ryota MIBAYASHI Takehiro YAMAMOTO Makoto P. KATO Yusuke YAMAMOTO Yoshiyuki SHOJI Hiroaki OHSHIMA
Taku WAKUI Fumio TERAOKA Takao KONDO
Shaobao Wu Zhihua Wu Meixuan Huang
Koji KAMMA Toshikazu WADA
Dingjie PENG Wataru KAMEYAMA
Zhizhong WANG Wen GU Zhaoxing LI Koichi OTA Shinobu HASEGAWA
Tomoaki YAMAZAKI Seiya ITO Kouzou OHARA
Daihei ISE Satoshi KOBAYASHI
Masanari ICHIKAWA Yugo TAKEUCHI
Shota SUZUKI Satoshi ONO
Reoma MATSUO Toru KOIZUMI Hidetsugu IRIE Shuichi SAKAI Ryota SHIOYA
Hirotaka HACHIYA Fumiya NISHIZAWA
Issa SUGIURA Shingo OKAMURA Naoto YANAI
Mudai KOBAYASHI Mohammad Mikal Bin Amrul Halim Gan Takahisa SEKI Takahiro HIROFUCHI Ryousei TAKANO Mitsuhiro KISHIMOTO
Chi ZHANG Luwei ZHANG Toshihiko YAMASAKI
Jung Min Lim Wonho Lee Jun-Hyeong Choi Jong Wook Kwak
Zhuo ZHANG Donghui LI Kun JIANG Ya LI Junhu WANG Xiankai MENG
Takayoshi SHIKANO Shuichi ICHIKAWA
Shotaro ISHIKURA Ryosuke MINAMI Miki YAMAMOTO
Pengfei ZHANG Jinke WANG Yuanzhi CHENG Shinichi TAMURA
Fengqi GUO Qicheng LIU
Runlong HAO Hui LUO Yang LI
Rongchun XIAO Yuansheng LIU Jun ZHANG Yanliang HUANG Xi HAN
Yong JIN Kazuya IGUCHI Nariyoshi YAMAI Rei NAKAGAWA Toshio MURAKAMI
Toru HASEGAWA Yuki KOIZUMI Junji TAKEMASA Jun KURIHARA Toshiaki TANAKA Timothy WOOD K. K. RAMAKRISHNAN
Rikima MITSUHASHI Yong JIN Katsuyoshi IIDA Yoshiaki TAKAI
Zezhong LI Jianjun MA Fuji REN
Lorenzo Mamelona TingHuai Ma Jia Li Bright Bediako-Kyeremeh Benjamin Kwapong Osibo
Wonho LEE Jong Wook KWAK
Xiaoxiao ZHOU Yukinori SATO
Kento WATANABE Masataka GOTO
Kazuyo ONISHI Hiroki TANAKA Satoshi NAKAMURA
Takashi YOKOTA Kanemitsu OOTSU
Chenbo SHI Wenxin SUN Jie ZHANG Junsheng ZHANG Chun ZHANG Changsheng ZHU
Masateru TSUNODA Ryoto SHIMA Amjed TAHIR Kwabena Ebo BENNIN Akito MONDEN Koji TODA Keitaro NAKASAI
Masateru TSUNODA Takuto KUDO Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI Kenichi MATSUMOTO
Hiroaki AKUTSU Ko ARAI
Lanxi LIU Pengpeng YANG Suwen DU Sani M. ABDULLAHI
Xiaoguang TU Zhi HE Gui FU Jianhua LIU Mian ZHONG Chao ZHOU Xia LEI Juhang YIN Yi HUANG Yu WANG
Yingying LU Cheng LU Yuan ZONG Feng ZHOU Chuangao TANG
Jialong LI Takuto YAMAUCHI Takanori HIRANO Jinyu CAI Kenji TEI
Wei LEI Yue ZHANG Hanfeng XIE Zebin CHEN Zengping CHEN Weixing LI
David CLARINO Naoya ASADA Atsushi MATSUO Shigeru YAMASHITA
Takashi YOKOTA Kanemitsu OOTSU
Xiaokang Jin Benben Huang Hao Sheng Yao Wu
Tomoki MIYAMOTO
Ken WATANABE Katsuhide FUJITA
Masashi UNOKI Kai LI Anuwat CHAIWONGYEN Quoc-Huy NGUYEN Khalid ZAMAN
Takaharu TSUBOYAMA Ryota TAKAHASHI Motoi IWATA Koichi KISE
Chi ZHANG Li TAO Toshihiko YAMASAKI
Ann Jelyn TIEMPO Yong-Jin JEONG
Jiakun LI Jiajian LI Yanjun SHI Hui LIAN Haifan WU
Nikolay FEDOROV Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI
Yukasa MURAKAMI Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI
Akira ITO Yoshiaki TAKAHASHI
Rindo NAKANISHI Yoshiaki TAKATA Hiroyuki SEKI
Chuzo IWAMOTO Ryo TAKAISHI
Koichi FUJII Tomomi MATSUI
Kazuyuki AMANO
Takumi SHIOTA Tonan KAMATA Ryuhei UEHARA
Hitoshi MURAKAMI Yutaro YAMAGUCHI
Kento KIMURA Tomohiro HARAMIISHI Kazuyuki AMANO Shin-ichi NAKANO
Ryotaro MITSUBOSHI Kohei HATANO Eiji TAKIMOTO
Naohito MATSUMOTO Kazuhiro KURITA Masashi KIYOMI
Tomohiro KOBAYASHI Tomomi MATSUI
Shin-ichi NAKANO
Ming PAN
Sung Woo CHUNG Hyong-Shik KIM Chu Shik JHON
In CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time. Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certain value does not reduce the cache miss rate significantly, in this paper, we propose two split L2 caches to utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cache and another cache to reduce the remote memory access time. Both work together to reduce total L2 cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. For another cache, we propose two alternatives: an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total execution time by up to 27%. It is also found that the proposed split L2 caches outperform the traditional single LRU cache of double size.
Weijia JIA Bo HAN Pui On AU Yong HE Wanlei ZHOU
Cluster computation has been used in the applications that demand performance, reliability, and availability, such as cluster server groups, large-scale scientific computations, distributed databases, distributed media-on-demand servers and search engines etc. In those applications, multicast can play the vital roles for the information dissemination among groups of servers and users. This paper proposes a set of novel efficient fault-tolerant multicast routing algorithms on hypercube interconnection of cluster computers using multicast shared tree approach. We present some new algorithms for selecting an optimal core (root) and constructing the shared tree so as to minimize the average delay for multicast messages. Simulation results indicate that our algorithms are efficient in the senses of short end-to-end average delay, load balance and less resource utilizations over hypercube cluster interconnection networks.
Bit-errors in a subband of a wavelet-based video frame during network transmission affect not only lower-level subbands within the same frame but also the subsequent frames. This is because the video frame is wavelet-transformed image with multi-levels and referenced from later frames. In this paper, we propose a new motion estimation scheme for wavelet-based video called Intra-frame Motion Estimation (IME), in which each subband except the LL subband refers to the 1-level-lower subband in the same orientation within the same frame. This scheme protects video quality by confining the effects of the bit-errors of all subbands, except the LL subband, within a frame. We evaluated the performance of our proposed scheme in a simulated wireless network environment. As a result of tests, it was shown that the proposed IME algorithm performs better than MRME, a motion-compensated video coding scheme for wavelet video, in a heavy motion video sequence, while IME outperforms MRME at a high bit-rate in small motion video sequence.
Ping-Jer YEH Yu-Chen CHUANG Shyan-Ming YUAN
Traditional Web cache servers based on HTTP and ICP infrastructure tend to have higher hardware and management cost, have difficulty in availability, automatic and dynamic reconfiguration, and may have slow links to some users. We find that peer-to-peer technology can help solve these problems. The peer cache service (PCS) we proposed here leverages each peer's local cache, similar access patterns, fully distributed coordination, and fast communication channels to enhance response time, scale of cacheable objects, and availability. Moreover, incorporating goals and strategies such as making the protocol lightweight and mutually compatible with existing cache infrastructure, supporting mobile devices, undertaking dynamic three-level caching, and exchanging cache meta-information further improve the effectiveness and differentiate our work from other similar-at-first-glance P2P Web cache systems.
This paper proposes several novel hierarchical interconnection networks based on the (3, 3)-graphs, namely folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks. Just as the hypercubes, CCC, Peterson-based networks, and Heawood-based networks, these hierarchical networks have the following nice properties: regular topology, high scalability, and small diameters. Due to these important properties, these hierarchical networks seem to have the potential as alternatives for the future interconnection structures of multicomputer systems, especially massively parallel processors (MPPs). Furthermore, this paper will present the routing and broadcasting algorithms for these proposed networks to demonstrate that these algorithms are as elegant as the algorithms for hypercubes, CCC, and Petersen- or Heawood-based networks.
Nader MOHAMED Jameela AL-JAROODI Hong JIANG
High performance scientific and engineering applications running on clusters have different communication requirements. Current cluster configurations typically provide multiple network interfaces per node and multiple interconnections among nodes. However, transport protocols such as TCP do not utilize existing multiple network interfaces to enhance communication performance. This paper introduces a new configurable communication model utilizing multiple interconnections. The model adds mechanisms to manage and enhance the overall communication performance of clusters. These configurations include the use of parallel message transfers, the separation of the transfer channels between small messages and large messages, and load balancing among the channels. The main advantages of the model are: (1) providing a flexible, enhanced network infrastructure, (2) hiding the technical details of the heterogeneous network resources from the applications, and (3) providing an easy and flexible way to extend the network capacities for specific nodes. To illustrate the advantages and performance enhancements of the model, a prototype was implemented to experimentally evaluate the cluster network performance, which showed considerable gains.
Tran CONG SO Shigeru OYANAGI Katsuhiro YAMAZAKI
We have proposed a speculative selection function for adaptive routing, which uses idle cycles of the network physical links to exchange network information between nodes, thus helps to decide the best selection. Previous study on the mesh network showed that SSR gives message selection flexibility that improves network performance by balancing the network traffic in both global and local scopes. This paper evaluates the speculative selection function on 2D torus network with simulation. The simulation compares the network throughput and latency with various traffic patterns. The visualization graphs show how the speculative selection eliminates hotspots and disperses traffic in the global scope. The simulation results demonstrate that by using speculative selection, the network performance is increased by around 7%. Compared to the mesh network, the torus's version has smaller gain due to the high performance nature of the torus network.
Eero AHO Jarno VANNE Kimmo KUUSILINNA Timo D. HAMALAINEN
Parallel memories increase memory bandwidth with several memory modules working in parallel and can be used to feed a processor with only necessary data. The Configurable Parallel Memory Architecture (CPMA) enables a multitude of access formats and module assignment functions to be used within a single hardware implementation, which has not been possible in prior embedded parallel memory systems. This paper focuses on address computation in CPMA, which is implemented using several configurable computation units in parallel. One unit is dedicated for each type of access formats and module assignment functions that the implementation supports. Timing and area estimates are given for a 0.25-micron CMOS process. The utilized resources are shown to be linearly proportional to the number of memory modules.
Byeonghag SEONG Donggook KIM Yangwoo ROH Kyuho PARK Daeyeon PARK
Shared memory multiprocessors in which each processor has its own TLB must manage consistency among TLBs and a page table. As the large-scale CC-NUMA (cache-coherent non-uniform memory access) shared memory multiprocessors become popular, it is important for TLB consistency management algorithms to be highly scalable. In this paper, we propose a TLB update-hint algorithm as a scalable TLB consistency management solution for CC-NUMA multiprocessors. By using a lazy TLB invalidation approach, we reduced the number of unnecessary processor interruptions and idle-waiting time, and achieved a high level of scalability. Using a shared memory simulator, we evaluated the TLB update-hint algorithm. For performance comparison, we also simulated the TLB shootdown algorithm, one of the most popular TLB consistency algorithms. The simulations demonstrated that the TLB update-hint algorithm scales well in systems with a large number of processors. At 64 node systems, the TLB update-hint algorithm shows 47
Fan CHAN Jiannong CAO Alvin T.S. CHAN Minyi GUO
Many parallel applications involve different independent tasks with their own data. Using the MPMD model, programmers can have a modular view and simplified structure of the parallel programs. Although MPI supports both SPMD and MPMD models for programming, MPI libraries do not provide an efficient way for task communication for the MPMD model. We have developed a programming environment, called ClusterGOP, for building and developing parallel applications. Based on the graph-oriented programming (GOP) model, ClusterGOP provides higher-level abstractions for message-passing parallel programming with the support of software tools for developing and running parallel applications. In this paper, we describe how ClusterGOP supports programming of MPMD parallel applications on top of MPI. We discuss the issues of implementing the MPMD model in ClusterGOP using MPI and evaluate the performance by using example applications.
Juan PIERNAS Toni CORTES Jose M. GARCIA
DualFS is a next-generation journaling file system which has the same consistency guaranties as traditional journaling file systems but better performance. This paper introduces three new enhancements which significantly improve DualFS performance during normal operation, and presents different experimental results which compare DualFS and other traditional file systems, namely, Ext2, Ext3, XFS, JFS, and ReiserFS. The experiments carried out prove, for the first time, that a new file system design based on separation of data and metadata can significantly improve file systems' performance without requiring several storage devices.
Suhyun KIM Soo-Mook MOON Kemal EBCIOLU Erik ALTMAN
For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. We believe higher Java performance can be achieved by exploiting instruction-level parallelism (ILP) in the context of Java JIT compilation. This paper introduces VLaTTe, a Java JIT compiler for VLIW machines that performs efficient scheduling while doing fast register allocation. It is an extended version of our previous JIT compiler for RISC machines called LaTTe whose translation overhead is low (i.e., consistently taking one or two seconds for SPECJVM98 benchmarks) due to its fast register allocation. VLaTTe adds the scheduling capability onto the same framework of register allocation, with a constraint for precise in-order exception handling which guarantees the same Java exception behavior with the original bytecode program. Our experimental results on the SPECJVM98 benchmarks show that VLaTTe achieves a geometric mean of useful IPC 1.7 (2-ALU), 2.1 (4-ALU), and 2.3 (8-ALU), while the scheduling/allocation overhead is 3.6 times longer than LaTTe's on average, which appears to be reasonable.
Sheng-Wen BAI Chu-Sing YANG Tsung-Chuan HUANG
In many parallel programs, run-time data redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. Research on data redistribution algorithms has recently matured. The time required to generate data sets and processor sets is much lesser than before. Therefore, packing/unpacking has become a relatively high cost in redistribution. In this paper, we present methods to perform BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution, using MPI user-defined datatypes. This method reduces the required memory buffers and avoids unnecessary movement of data. Theoretical models are presented to determine the best method for redistribution. The methods were implemented on an IBM SP2 parallel machine to evaluate the performance of the proposed methods. The experimental results indicate that this approach can clearly improve the redistribution in most cases.
Rodrigo Fernandes de MELLO Erico C. T. de MATTOS Luis Carlos TREVELIN Maria Stela Veludo de PAIVA Laurence T. YANG
The availability of a low cost hardware has increased the development of distributed systems, by making then more and more accessible. In order to optimize the resources allocation on the distributed systems, some load balancing algorithms have been proposed. These algorithms distribute the application loads over the environment computers, make homogeneous the occupation of the whole environment and increase the application performance. This equal distribution prevents certain computers to get overloaded, to the detriment of the idleness of the other ones. This article proposes and analyzes the TLBAGrid, a load balancing algorithm for Grid computing environments.
Shinji TANAKA Tetsuyasu YAMADA Satoshi SHIRAISHI
The sizes of recent Java-based server-side applications, like J2EE containers, have been increasing continuously. Past techniques for improving the performance of Java applications have targeted relatively small applications. Moreover, when the methods of these small target applications are invoked, they are not usually distributed over the entire memory space. As a result, these techniques cannot be applied efficiently to improve the performance of current large applications. We propose a dynamic code repositioning approach to improve the hit rates of instruction caches and translation look-aside buffers. Profiles of method invocations are collected when the application performs with its heaviest processor load, and the code is repositioned based on these profiles. We also discuss a method-splitting technique to significantly reduce the sizes of methods. Our evaluation of a prototype implementing these techniques indicated 5% improvement in the throughput of the application.
Techniques for automatic program recognition, at the algorithmic level, could be of high interest for the area of Software Maintenance, in particular for knowledge based reengineering, because the selection of suitable restructuring strategies is mainly driven by algorithmic features of the code. In this paper an automated hierarchical concept parsing recognition technique, and a formalism for the specification of algorithmic concepts, is presented. Based on this technique, the design and development of ALCOR, a production rule based system for automatic recognition of algorithmic concepts within programs, aimed at support of knowledge based reengineering for high performance, is presented.
In this paper we apply a parallel adaptive solution algorithm to simulate nanoscale double-gate metal-oxide-semiconductor field effect transistors (MOSFETs) on a personal computer (PC)-based Linux cluster with the message passing interface (MPI) libraries. Based on a posteriori error estimation, the triangular mesh generation, the adaptive finite volume method, the monotone iterative method, and the parallel domain decomposition algorithm, a set of two-dimensional quantum correction hydrodynamic (HD) equations is solved numerically on our constructed cluster system. This parallel adaptive simulation methodology with 1-irregular mesh was successfully developed and applied to deep-submicron semiconductor device simulation in our recent work. A 10 nm n-type double-gate MOSFET is simulated with the developed parallel adaptive simulator. In terms of physical quantities and refined adaptive mesh, simulation results demonstrate very good accuracy and computational efficiency. Benchmark results, such as load-balancing, speedup, and parallel efficiency are achieved and exhibit excellent parallel performance. On a 16 nodes PC-based Linux cluster, the maximum difference among CPUs is less than 6%. A 12.8 times speedup and 80% parallel efficiency are simultaneously attained with respect to different simulation cases.
In this paper, Message Passing Interface (MPI) techniques are used to implement high-order full 3-D Navier-Stokes equations in multi-domain applications. A two-domain interface with five-point overlapping used previously is expanded to a multi-domain computation. There are normally two approaches for this expansion. One is to break up the domain into two parts through domain decomposition (say, one overlapping), then using MPI directives to further break up each domain into n parts. Another is to break the domain up into 2n parts with (2n-1) overlappings. In our present effort, the latter approach is used and finite-size overlappings are employed to exchange data between adjacent multi-dimensional sub-domains. It is an alternative way to parallelize the high-order full 3-D Navier-Stokes equations into multi-domain applications without adding much complexity. Results with high-order boundary treatments show consistency among multi-domain calculations and single-domain results.
Cheong Ghil KIM Hong-Sik KIM Sungho KANG Shin Dug KIM Gunhee HAN
Scientific computations for diffusion equations and ANNs (Artificial Neural Networks) are data intensive tasks accompanied by heavy memory access; on the other hand, their computational complexities are relatively low. Thus, this type of tasks naturally maps onto SIMD (Single Instruction Multiple Data stream) parallel processing with distributed memory. This paper proposes a high performance acceleration processor of which architecture is optimized for scientific computing using diffusion equations and ANNs. The proposed architecture includes a customized instruction set and specific hardware resources which consist of a control unit (CU), 16 processing units (PUs), and a non-linear function unit (NFU) on chip. They are effectively connected with dedicated ring and global bus structure. Each PU is equipped with an address modifier (AM) and 16-bit 1.5 k-word local memory (LM). The proposed processor can be easily expanded by multi-chip expansion mode to accommodate to a large scale parallel computation. The prototype chip is implemented with FPGA. The total gate count is about 1 million with 530, 432-bit embedded memory cells and it operates at 15 MHz. The functionality and performance of the proposed processor is verified with simulation of oil reservoir problem using diffusion equations and character recognition application using ANNs. The execution times of two applications are compared with software realizations on 1.7 GHz Pentium IV personal computer. Though the proposed processor architecture and the instruction set are optimized for diffusion equations and ANNs, it provides flexibility to program for many other scientific computation algorithms.
Dejiang JIN Sotirios G. ZIAVRAS
The multiplication of large spare matrices is a basic operation in many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. For a parallel environment, it is essential to partition the entire operation into well balanced tasks and assign them to individual processing elements. Most of the existing techniques partition the given matrices based on some kind of workload estimation. For irregular sparse matrices on PC clusters, however, the workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may degrade performance. In this paper, we apply our super-programming approach to parallel large matrix multiplication on PC clusters. In our approach, tasks are partitioned into super-instructions that are dynamically assigned to member computer nodes. Thus, the load balancing logic is separated from the computing logic; the former is taken over by the runtime environment. Our super-programming approach facilitates ease of program development and targets high efficiency in dynamic load balancing. Workloads can be balanced effectively and the optimization overhead is small. The results prove the viability of our approach.
Michael (Shan-Hui) HO Weng-Long CHANG Minyi GUO Laurence T. YANG
This paper shows how to use sticker to construct solution space of DNA for the library sequences in the set-packing problem and the clique problem. Then, with biological operations, we propose DNA-based algorithms to remove illegal solutions and to find legal solutions for the set-packing and clique problems from the solution space of sticker. Any NP-complete problem in Cook's Theorem can be reduced and solved by the proposed DNA-based computing approach if its size is equal to or less than that of the set-packing problem. Otherwise, Cook's Theorem is incorrect on DNA-based computing and a new DNA algorithm should be developed from the characteristics of the NP-complete problem. Finally, the result to DNA simulation is given.
Graph data in large scientific/engineering applications are often too massive to fit inside the computer's main memory. The resulting input/output (I/O) costs could be a major performance bottleneck. This paper proposes an extension to extant multilevel graph partitioning algorithms with improved I/O-efficiency. The input graph is envisioned as the union of disjoint blocks (subgraphs) of almost the same size. Each block is coarsened in turn. Recursive matching and contraction are the operations in this phase. All the coarsened blocks are then merged in an iterative manner in order to ensure that the resulting graph fits in the main memory. This graph is then treated with an in-core multilevel graph partitioning algorithm in the usual way. Our experimental results show that the larger graph size is, the more dependent on the I/O-efficiency the performance is. And our modification can easily partition very large graphs. It also exhibits considerable improvement in I/O-complexity.
By using distributed database systems, many advantages can be obtained such as database management cost, efficiency, and high integrity of systems through allocating fragments to many distributed sites with horizontal/vertical fragmentation of global database schema. To minimize costs, distributed algorithms must be applied so that database fragments are allocated to optimal sites. It is useful to replicate fragments, such as allocating many copies in many sites including load balancing. But there are too many possible combinations of each site and fragment, making it impossible to find a solution in real time, i.e., it is an NP-complete problem. This paper proposes near optimal heuristic algorithms for minimizing cost by defining a cost model based on read and update queries that are requested in many sites. Various factors are applied to the proposed algorithms for sizing efficient network resources that compute database transactions as remote query or update requests for consistency in replicated database systems. For network load balancing, incoming network traffic table is defined in each site. A request transaction from unallocated sites to allocated sites can be accessed properly at any other replicated sites by using the network traffic table. Finally, some experimental results verified the proposed algorithms by comparing actual cases of database allocation.
Huabing ZHU Tony K.Y. CHAN Lizhe WANG Reginald C. JEGATHESE
This paper presents a prototype of a distributed 3D rendering system in a hierarchical Grid environment. 3D rendering with massive data sets is a computationally intensive task. In order to make full use of computational resources on Grids, a hierarchical system architecture is designed to run over multiple clusters. This architecture involves both sort-first and sort-last parallel rendering algorithms to achieve excellent scalability, rendering performance and load balance.
Mustafa MAT DERIS Noraziah AHMAD Md. Yazid Mohd SAMAN Noraida ALI Youwei YUAN
Data Replication can be used to improve the system availability of distributed systems. In such a system, a mechanism is required to maintain the consistency of the replicated data. The grid structure technique based on quorum is one of the solutions to perform this while providing a high availability of the system. It was shown in the study that, it still requires a bigger number of copies be made available to construct a quorum. So it is not suitable for large systems. In this paper, we propose a technique called the neighbor replication on grid (NRG) technique by considering only neighbors to have the replicated data. In comparison to the grid structure technique, NRG requires a lower communication cost for an operation, while providing a higher system availability, which is preferred for large systems.
Namyoon WOO Hyungsoo JUNG Heon Young YEOM Taesoon PARK Hyungwoo PARK
Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
Sabin TABIRCA Tatiana TABIRCA Laurence T. YANG Len FREEMAN
In this paper we consider the Feedback-Guided Dynamic Loop Scheduling (FGDLS) method that was proposed by Bull. The method uses a feedback-guided mechanism to schedule a parallel loop within a sequential outer loop. The execution times and the scheduling bounds at a outer iteration are used to find the scheduling bound of the next outer iteration. In this way FGDLS achieves an optimal load balance. Two algorithms have been proposed so far by Tabirca et al. In this article we will review these two algorithms and will give a comparison between their performances.
MaengSoon BAIK SungJin CHOI ChongSun HWANG JoonMin GIL ChanYeol PARK HeonChang YOO
Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols are based on a Piecewise Deterministic (PWD) Assumption model. They, however, assumed that all logged non-deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays and no global time source. In addition, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in recovery time.
Jacobo TARRIO Juan TOURIÑO María J. MARTIN Patricia GONZALEZ Ramon DOALLO
Grid computing can help to promote high-performance computing at a low overall cost by encouraging research centers to share their resources. However, research staff usually finds it quite hard to use Grids effectively, due to the need of installing and managing new Grid software. Thus, Grid portals are created, making it easier to take advantage of the full capability of the Grid, favoring in this way its use. The goal of this paper is to describe the process of design and implementation of a Grid Portal with the aim of both supporting distributed high-performance resources and make its use by researchers as transparent as possible. This portal uses standard Grid and Web technologies. We have designed the portal so that it can be adapted to different existing Grid infrastructures, based on the Globus Toolkit, and new functionalities can be easily added. The first prototype of the portal has been tested on an experimental Grid platform, and we present encouraging experiences carried out there.
This paper proposes an efficient query evaluation scheme for a mediator system intended to integrate heterogeneous computing environment in terms of operating systems, database management systems, and other software. Most of mediator systems transform a global query into a set of sub-queries based on their target remote servers. Each sub-query is evaluated by the query modification method to evaluate a global query. However, it is possible to reduce the evaluation cost of a global query when the results of frequently requested sub-queries are materialized in a mediator. In a mediator, its integrating schema can be incrementally modified and the evaluation frequency of a global query can also be continuously varied. In order to select the optimized set of materialized sub-queries with respect to their current evaluation frequencies, the proposed method applies a decay factor for modeling the recent access behavior of each sub-query. In other words, the latest access of a sub-query gets the highest attention in the selection process of materialized sub-queries. As a result, it is possible to adjust the optimized set of materialized sub-queries adaptively according to the recent changes in the evaluation frequencies of sub-queries. Since finding the optimum solution of this problem is NP-hard, it takes too long to be used in practice when the number of sub-queries is large. Consequently, given the size of mediator storage, the rank-based selection algorithm proposed in this paper finds the set of materialized sub-queries which minimizes the total evaluation cost of global queries in linear search complexity.
Biplab KUMER SARKER Anil KUMAR TRIPATHI Deo PRAKASH VIDYARTHI Laurence T. YANG Kuniaki UEHARA
In a Distributed Computing Systems (DCS) tasks submitted to it, are usually partitioned into different modules and these modules may be allocated to different processing nodes so as to achieve minimum turn around time of the tasks utilizing the maximum resources of the existing system such as CPU speed, memory capacities etc. The problem lies on how to obtain the optimal allocation of these multiple tasks by keeping in mind that no processing node is overloaded due to this allocation. This paper proposes an algorithm A*RS, using well-known A*, which aims to reduce the search space and time for task allocation. It aims at minimization of turn around time of tasks in the way so that processing nodes do not become overloaded due to this allocation. Our experimental results justify the claims with necessary supports by comparing it with the earlier algorithm for multiple tasks allocation.
Yeu-Horng SHIAU Jer Min JOU Chin-Chi LIU
In this paper, two efficient VLSI architectures for biorthogonal wavelet transform are proposed. One is constructed by the filter bank implementation and another is constructed by the lifting scheme. In the filter bank implementation, due to the symmetric property of biorthogonal wavelet transform, the proposed architecture uses fewer multipliers than the orthogonal wavelet transform. Besides, the polyphase decomposition is adopted to speed up the processing by a factor of 2. In the lifting scheme implementation, the pipeline-scheduling technique is employed to optimize the architecture. Both two architectures are with advantages of lower implementation complexity and higher throughput rate. Moreover, they can also be applied to realize the inverse DWT efficiently. Based on the above properties, the two architectures can be applied to time-critical image compressions, such as JPEG2000. Finally, the architecture constructed by the lifting scheme is implemented into a single chip on 0.35 µm 1P4M CMOS technology, and its area and working performance are 5.005
In this paper, we propose a statistical method of time synchronization for computer clocks that have precisely frequency-synchronized oscillators. This method not only improves the accuracy of time synchronization but also prevents degradation in the frequency stability of the precise oscillators when the errors in the measured time offsets between computer clocks caused by network traffic possess a Gaussian distribution. Improved accuracy of time synchronization is achieved by estimating the confidence interval of the measured time offsets between the clocks. Degradation in frequency stability is prevented by eliminating unnecessary time correction for the computer clock, because time correction generally causes changes in the frequency accuracy and stability of the precise oscillators. To eliminate unnecessary time correction, our method uses an extended hypothesis test of the difference between the current mean and the mean at the last time adjustment to determine whether time correction is needed. Evaluation by simulating changes in the time offset of the existing ISDN clock synchronization system showed that this method achieves accurate time and stable frequency synchronization.
M. M. Hafizur RAHMAN Susumu HORIGUCHI
Interconnection networks play a crucial role in the performance of massively parallel computers. Hierarchical interconnection networks provide high performance at low cost by exploring the locality that exists in the communication patterns of massively parallel computers. A Hierarchical Torus Network (HTN) is a 2D-torus network of multiple basic modules, in which the basic modules are 3D-torus networks that are hierarchically interconnected for higher level networks. The static network performance of the HTN has already been studied and has been shown to be good. Dynamic communication performance has been evaluated under uniform traffic pattern but not under non-uniform traffic patterns. In this paper, we present a deadlock-free routing algorithm for the HTN using 3 virtual channels and evaluate the network's dynamic communication performance under three non-uniform traffic patterns, using the proposed routing algorithm. We evaluate the dynamic communication performance of HTN, H3D-mesh, H3D-torus, TESH, and mesh networks by computer simulation. We find that the dynamic communication performance of HTN is better than that of the H3D-mesh, H3D-torus, TESH, and mesh networks.
Motion capture technology is widely used to make a realistic motion in these days. Different motion capture devices use different motion capture data formats. Because of the lack of compatibility of motion capture data animators can't reuse the already captured motion sequence. In addition, it is difficult for integrating, storing and retrieving motion capture data with different formats in the storage. In this paper, we propose a standard format for integrating a different motion capture data formats. In addition, we propose a framework of a system that manages motion capture data using our standard format. Our standard format is called MCML (Motion Capture Markup Language). It is a markup language for motion capture data and is based on XML (extensible Markup Language). Our system designed to manage motion capture data consists of a several components -- Mocap Syntax Analyzer, MCML Converter, MCML Editor, Motion Viewer, MCML Storage Wrapper.
The alignment of biological sequences is a crucial tool in molecular biology and genome analysis. A wide variety of approaches has been proposed for multiple sequence alignment problem; however, some of them need prerequisites to help find the best alignment or some of them may suffer from the drawbacks of complexity and memory requirement so they can be only applied to cases with a limited number of sequences. In this paper, we view the multiple sequence alignment problem as an optimization problem and propose a heuristic-based genetic algorithm (GA) approach to solve it. The heuristic/GA hybrid yields better results than other well-known packages do. Experimental results are presented to illustrate the feasibility of the proposed approach.
Seiichi NAKAGAWA Naoki NAKAMURA Kazumasa MORI
In this paper, we propose a statistical method of evaluating the pronunciation proficiency of English words spoken by Japanese. We analyzed statistically the utterances to note a combination that has a high correlation between an English teacher's score and certain acoustic features. We obserbed that the phoneme recognition rates (correct rate and accuracy) were the best measure of pronunciation proficiency, and the likelihood ratio of English phoneme acoustic models to phoneme acoustic models adapted by Japanese was the second best measure. The effective measure which was highly correlated with the English teacher's score was the combination of the likelihood for American native models, likelihood for English models adapted by Japanese, the best likelihood for arbitrary sequences of acoustic models, phoneme recognition rate and the rate of speech. We obtained a correlation coefficient of 0.81 with an open data for vocabulary and 0.69 with open data for speaker at the five words set level, respectively. The coefficient was higher than the correlation between humans' scores, 0.65. In the 15 words set level which corresponds to one or two sentences, we obtained the correlation coefficient of 0.86 with open data for the speaker.
Iris recognition is used to identify a user based on the iris texture information which exists between the white sclera and the black pupil. For fast iris recognition, it is very important to capture user's focused eye image at fast speed. If not, the total recognition time is increased and it makes the user feel much inconvenience. In previous researches and systems, they use the focusing method which has been used for general landscape scene without considering the characteristics of iris image. So, they take much focusing time sometimes, especially in case of the user with glasses. To overcome such problems, we propose a new iris image acquisition method to capture user's focused eye image at very fast speed based on the corneal specular reflection. Experimental results show that the focusing time for both the users with glasses and without glasses is average 480 ms and we can conclude our method can be used for the real-time iris recognition camera.
Yasuyuki SUGAYA Kenichi KANATANI
Many techniques have been proposed for segmenting feature point trajectories tracked through a video sequence into independent motions, but objects in the scene are usually assumed to undergo general 3-D motions. As a result, the segmentation accuracy considerably deteriorates in realistic video sequences in which object motions are nearly degenerate. In this paper, we propose a multi-stage unsupervised learning scheme first assuming degenerate motions and then assuming general 3-D motions and show by simulated and real video experiments that the segmentation accuracy significantly improves without compromising the accuracy for general 3-D motions.
Hanxi ZHU Ikuo YOSHIHARA Kunihito YAMAMORI Moritoshi YASUNAGA
We have developed Multi-modal Neural Networks (MNN) to improve the accuracy of symbolic sequence pattern classification. The basic structure of the MNN is composed of several sub-classifiers using neural networks and a decision unit. Two types of the MNN are proposed: a primary MNN and a twofold MNN. In the primary MNN, the sub-classifier is composed of a conventional three-layer neural network. The decision unit uses the majority decision to produce the final decisions from the outputs of the sub-classifiers. In the twofold MNN, the sub-classifier is composed of the primary MNN for partial classification. The decision unit uses a three-layer neural network to produce the final decisions. In the latter type of the MNN, since the structure of the primary MNN is folded into the sub-classifier, the basic structure of the MNN is used twice, which is the reason why we call the method twofold MNN. The MNN is validated with two benchmark tests: EPR (English Pronunciation Reasoning) and prediction of protein secondary structure. The reasoning accuracy of EPR is improved from 85.4% by using a three-layer neural network to 87.7% by using the primary MNN. In the prediction of protein secondary structure, the average accuracy is improved from 69.1% of a three-layer neural network to 74.6% by the primary MNN and 75.6% by the twofold MNN. The prediction test is based on a database of 126 non-homologous protein sequences.
Jiahai WANG Zheng TANG Ronglong WANG
In this paper, based on maximum neural network, we propose a new parallel algorithm that can escape from local minima and has powerful ability of searching the globally optimal or near-optimum solution for the maximum clique problem (MCP). In graph theory a clique is a completely connected subgraph and the MCP is to find a clique of maximum size of a graph. The MCP is a classic optimization problem in computer science and in graph theory with many real-world applications, and is also known to be NP-complete. Lee and Takefuji have presented a very powerful neural approach called maximum neural network for this NP-complete problem. The maximum neural model always guarantees a valid solution and greatly reduces the search space without a burden on the parameter-tuning. However, the model has a tendency to converge to the local minimum easily because it is based on the steepest descent method. By adding a negative self-feedback to the maximum neural network, we proposed a parallel algorithm that introduces richer and more flexible chaotic dynamics and can prevent the network from getting stuck at local minima. After the chaotic dynamics vanishes, the proposed algorithm is then fundamentally reined by the gradient descent dynamics and usually converges to a stable equilibrium point. The proposed algorithm has the advantages of both the maximum neural network and the chaotic neurodynamics. A large number of instances have been simulated to verify the proposed algorithm.
Sung Woo CHUNG Gi Ho PARK Sung Bae PARK
This letter proposes a low-power tournament branch predictor, in which the number of accesses to the branch predictors (local predictor or global predictor) is reduced. Analysis results with Samsung Memory Compiler show that the proposed branch predictor reduces the power consumption by 24-45%, compared to the conventional tournament branch predictor, not requiring any additional storage arrays, not incurring any additional delay and never harming accuracy.
Chanjin PARK Euyseok HONG Chisu WU
This paper proposes a new type of relationship between layers in layered architecture and shows how to concretize the relationship between layers into design constraints. The meaning of layer relationship is explained with examples from design patterns and Microsoft COM. In addition, a prototype tool to check conformance is implemented and the architecture document of an open-source software project is checked against the actual architecture extracted from source code developed by many international developers. As a result of checking, parts that do not conform to the architecture document are investigated and it is pointed out that their modifications should be controlled with caution.
BaekSop KIM HyeJeong SONG JongDae KIM
This paper presents a local learning framework in which the local classifiers can be pre-learned and the support size of each classifier can be selected to minimize the error bound. The proposed algorithm is compared with the conventional support vector machine (SVM). Experimental results show that our scheme using the user-defined parameters C and σ is more accurate and less sensitive than the conventional SVM.
Hong Kook KIM Mi Suk LEE Chul Hong KWON
A new excitation enhancement technique based on a harmonic model is proposed in this paper to improve the speech quality of low-bit-rate speech coders. This technique is employed only in the decoding process of speech coders and improves high-frequency components of excitation. We develop the procedure of harmonic model parameters estimation and harmonic generation and apply the technique to a current state-of-art low bit rate speech coder. Experiments on spectrum reading and spectrum distortion measurement show that the proposed excitation enhancement technique improves speech quality.
This letter introduces a pre-rejection technique for wireless channel distorted speech with application to automatic speech recognition (ASR). Based on analysis of distorted speech signals over a wireless communication channel, we propose a method to reject the channel distorted speech with a small computational load. From a number of simulation results, we can discover that the pre-rejection algorithm enhances the robustness of speech recognition operation.
In this letter we suggest sets of features to classify genres of web documents. Web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances. On the basis of the experimental results, we implement a user interface of a web search engine that presents documents grouped by genres.
A method is presented for selecting items asked for new users to input their preference rates on those items in recommendation systems based on the collaborative filtering. Optimal item selection is formulated by an integer programming problem and we solve it by using a kind of the Hopfield-network-like scheme for interior point methods.