Keyword Search Result

[Keyword] distributed memory(10hit)

1-10hit
  • A Novel Sequential Tree Algorithm Based on Scoreboard for MPI Broadcast Communication

    Won-young CHUNG  Jae-won PARK  Seung-Woo LEE  Won Woo RO  Yong-surk LEE  

     
    LETTER-Computer System

      Vol:
    E94-D No:12
      Page(s):
    2523-2527

    The message passing interface (MPI) broadcast communication commonly causes a severe performance bottleneck in multicore system that uses distributed memory. Thus, in this paper, we propose a novel algorithm and hardware structure for the MPI broadcast communication to reduce the bottleneck situation. The transmission order is set based on the state of each processing node that comprises the multicore system, so the novel algorithm minimizes the performance degradation caused by conflict. The proposed scoreboard MPI unit is evaluated by modeling it with SystemC and implemented using VerilogHDL. The size of the proposed scoreboard MPI unit occupies less than 1.03% of the whole chip, and it yields a highly improved performance up to 75.48% as its maximum with 16 processing nodes. Hence, with respect to low-cost design and scalability, this scoreboard MPI unit is particularly useful towards increasing overall performance of the embedded MPSoC.

  • A Low-Cost Standard Mode MPI Hardware Unit for Embedded MPSoC

    Won-young CHUNG  Ha-young JEONG  Won Woo RO  Yong-surk LEE  

     
    LETTER-Computer System

      Vol:
    E94-D No:7
      Page(s):
    1497-1501

    In this paper, we propose a novel low-cost Message Passing Interface (MPI) unit between processor nodes, which supports message passing in multiprocessor systems using distributed memory architecture. Our MPI unit operates in the standard mode – using the buffered mode for small amounts of data transaction and the synchronous mode for large amounts of data transaction. This results in increased performance by reducing the control message transmission time for the small amount of data. We verified the performance with a simulator designed based on SystemC. Additionally, we designed the MPI unit using VerilogHDL, and we synthesized it with a synopsys design compiler. The proposed standard mode MPI unit shows a high performance even though the size of the MPI unit occupies less than 1% of the whole chip. Thus, with respect to low-cost design and scalability, this MPI hardware unit is useful to increase overall performance of the embedded Multiprocessor System on a Chip (MPSoC).

  • A Performance Comparison of the Parallel Preconditioners for Iterative Methods for Large Sparse Linear Systems Arising from Partial Differential Equations on Structured Grids

    Sangback MA  

     
    PAPER-Numerical Analysis and Optimization

      Vol:
    E91-A No:9
      Page(s):
    2578-2587

    In this paper we compare various parallel preconditioners such as Point-SSOR (Symmetric Successive OverRelaxation), ILU(0) (Incomplete LU) in the Wavefront ordering, ILU(0) in the Multi-color ordering, Multi-Color Block SOR (Successive OverRelaxation), SPAI (SParse Approximate Inverse) and pARMS (Parallel Algebraic Recursive Multilevel Solver) for solving large sparse linear systems arising from two-dimensional PDE (Partial Differential Equation)s on structured grids. Point-SSOR is well-known, and ILU(0) is one of the most popular preconditioner, but it is inherently serial. ILU(0) in the Wavefront ordering maximizes the parallelism in the natural order, but the lengths of the wavefronts are often nonuniform. ILU(0) in the Multi-color ordering is a simple way of achieving a parallelism of the order N, where N is the order of the matrix, but its convergence rate often deteriorates as compared to that of natural ordering. We have chosen the Multi-Color Block SOR preconditioner combined with direct sparse matrix solver, since for the Laplacian matrix the SOR method is known to have a nondeteriorating rate of convergence when used with the Multi-Color ordering. By using block version we expect to minimize the interprocessor communications. SPAI computes the sparse approximate inverse directly by least squares method. Finally, ARMS is a preconditioner recursively exploiting the concept of independent sets and pARMS is the parallel version of ARMS. Experiments were conducted for the Finite Difference and Finite Element discretizations of five two-dimensional PDEs with large meshsizes up to a million on an IBM p595 machine with distributed memory. Our matrices are real positive, i.e., their real parts of the eigenvalues are positive. We have used GMRES(m) as our outer iterative method, so that the convergence of GMRES(m) for our test matrices are mathematically guaranteed. Interprocessor communications were done using MPI (Message Passing Interface) primitives. The results show that in general ILU(0) in the Multi-Color ordering and ILU(0) in the Wavefront ordering outperform the other methods but for symmetric and nearly symmetric 5-point matrices Multi-Color Block SOR gives the best performance, except for a few cases with a small number of processors.

  • Essential Cycle Calculation Method for Irregular Array Redistribution

    Sheng-Wen BAI  Chu-Sing YANG  

     
    PAPER-Computation and Computational Models

      Vol:
    E89-D No:2
      Page(s):
    789-797

    In many parallel programs, run-time array redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. In general, array distribution can be classified into regular distribution and irregular distribution according to the distribution fashion. Many methods for performing regular array redistribution have been presented in the literature. However, for the heterogeneous computation environment, irregular array redistributions can be used to adjust data assignment at run-time. In this paper, an Essential Cycle Calculation method for unequal block sizes array redistribution is presented. In the ECC method, a processor first computes the source/destination processor/data sets of array elements in the first essential cycle of the local array it owns. From the source/destination processor/data sets of array elements in the first essential cycle, we can construct packing/unpacking pattern tables. Since each essential cycle has the same communication pattern, based on the packing/unpacking pattern tables, a processor can pack/unpack array elements efficiently. To evaluate the performance of the ECC method, we have implemented this method on an IBM SP2 parallel machine and compare it with the Sequence method. The cost models for these methods are also presented. The experimental results show that the ECC method greatly outperforms the Sequence method for all test samples.

  • Efficient Loop Partitioning for Parallel Codes of Irregular Scientific Computations

    Minyi GUO  

     
    PAPER-Software Systems

      Vol:
    E86-D No:9
      Page(s):
    1825-1834

    In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is ensured when executing that iteration. Then, after all iterations are partitioned into various processors, we give global vs. local data transformation rule, indirection arrays remapping and communication optimization methods. The experimental results show that, in most cases, our approaches achieved better performance than other loop partitioning rules.

  • Iterative Methods for Dense Linear Systems on Distributed Memory Parallel Computers

    Muneharu YOKOYAMA  Takaomi SHIGEHARA  Hiroshi MIZOGUCHI  Taketoshi MISHIMA  

     
    PAPER

      Vol:
    E82-A No:3
      Page(s):
    483-486

    The Conjugate Residual method, one of the iterative methods for solving linear systems, is applied to the problems with a dense coefficient matrix on distributed memory parallel computers. Based on an assumption on the computation and communication times of the proposed algorithm for parallel computers, it is shown that the optimal number of processing elements is proportional to the problem size N. The validity of the prediction is confirmed through numerical experiments on Hitachi SR2201.

  • Efficient Implementation of Multi-Dimensional Array Redistribution

    Minyi GUO  Yoshiyuki YAMASHITA  Ikuo NAKATA  

     
    PAPER-Sofware System

      Vol:
    E81-D No:11
      Page(s):
    1195-1204

    Array redistribution is required very often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of programs may degrade considerably. In this paper, we focus on automatic generation of communication routines for multi-dimensional redistribution. The principal advantage of this work is to gain the ability to handle redistribution between arbitrary source and destination processor sets and between arbitrary source and destination distribution schemes. We have implemented these algorithms using Parallelware communication library. Some experimental results show the efficiency and flexibility of our techniques compared to the other redistribution works.

  • Accelerated Composition for Parallel Volume Rendering

    Tetu HIRAI  Tsuyoshi YAMAMOTO  

     
    PAPER-Image Processing,Computer Graphics and Pattern Recognition

      Vol:
    E81-D No:1
      Page(s):
    81-87

    We describe an algorithm for efficiently compositing partial images generated during parallel volume rendering on a distributed memory parallel computer. In this object space partitioning algorithm, each PE is assigned to several subvolumes where each subvolume has a corresponding local frame buffer. After volume rendering is performed independently for each subvolume, the partial images stored in the local frame buffers are combined to generate a complete image. During this compositing process, the communication of partial image data between the PEs is kept minimal by assigning PEs to subvolumes in an interleaved manner. This assignment makes possible a reduction in communication in the axis direction in which there is the most communication. Experimental results indicate that a 9% to 35% reduction in the total rendering time can be attained with no additional data structures and no memory overhead.

  • Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems

    Jun MIYAZAKI   Haruo YOKOTA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1046-1054

    Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.

  • Minimizing the Data Transfer in Evaluating an Expression in a Distributed-Memory Parallel-Processing System

    Hiroshi OHTA  Kousuke SAKODA  Koichiro ISHIHARA  

     
    PAPER-Computer Systems

      Vol:
    E77-D No:3
      Page(s):
    288-298

    In a distributed-memory parallel-processing system, the overhead of data transfer among the processors is so large that it is important to reduce the data transfer. We consider the data transfer in evaluating an expression consisting of data distributed among the processors. We propose some algorithms which assign the operators in the expression to the processors so as to minimize the number or the cost of data transfers, on the condition that the data allocation to the processors is given. The basic algorithm is given at first, followed by some variations.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.