Keyword Search Result

[Keyword] matrix multiplication(8hit)

  • Algorithms for Evaluating the Matrix Polynomial I+A+A2+…+AN-1 with Reduced Number of Matrix Multiplications

    Kotaro MATSUMOTO  Kazuyoshi TAKAGI  Naofumi TAKAGI  

    PAPER-Algorithms and Data Structures

    E101-A No:2

    The problem of evaluating the matrix polynomial I+A+A2+…+AN-1 with a reduced number of matrix multiplications has long been considered. Several algorithms have been proposed for this problem, which find a procedure requiring O(log N) matrix multiplications for a given N. Among them, the hybrid algorithm based on the double-base representation of N, i.e., using mixed radices 2 and 3, proposed by Dimitrov and Cooklev is most efficient. It has been suggested by them that the use of higher radices would not bring any more efficient algorithms. In this paper, we show that we can derive more efficient algorithms by using higher radices, and propose several efficient algorithms.

  • A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm

    Jie ZHOU  Feng YU  

    PAPER-Fundamentals of Information Systems

    E98-D No:11

    Due to its on-demand and pay-as-you-go properties, cloud computing has become an attractive alternative for HPC applications. However, communication-intensive applications with complex communication patterns still cannot be performed efficiently on cloud platforms, which are equipped with MapReduce technologies, such as Hadoop and Spark. In particular, one major obstacle is that MapReduce's simple programming model cannot explicitly manipulate data transfers between compute nodes. Another obstacle is cloud's relatively poor network performance compared with traditional HPC platforms. The traditional Strassen's algorithm of square matrix multiplication has a recursive and complex pattern on the HPC platform. Therefore, it cannot be directly applied to the cloud platform. In this paper, we demonstrate how to make Strassen's algorithm with complex communication patterns “cloud-friendly”. By reorganizing Strassen's algorithm in an iterative pattern, we completely separate its computations and communications, making it fit to MapReduce programming model. By adopting a novel data/task parallel strategy, we solve Strassen's data dependency problems, making it well balanced. This is the first instance of Strassen's algorithm in MapReduce-style systems, which also matches Strassen's communication lower bound. Further experimental results show that it achieves a speedup ranging from 1.03× to 2.50× over the classical Θ(n3) algorithm. We believe the principle can be applied to many other complex scientific applications.

  • The Optimal Architecture Design of Two-Dimension Matrix Multiplication Jumping Systolic Array

    Yun YANG  Shinji KIMURA  


    E91-A No:4

    This paper proposes an efficient systolic array construction method for optimal planar systolic design of the matrix multiplication. By connection network adjustment among systolic array processing element (PE), the input/output data are jumping in the systolic array for multiplication operation requirements. Various 2-D systolic array topologies, such as square topology and hexagonal topology, have been studied to construct appropriate systolic array configuration and realize high performance matrix multiplication. Based on traditional Kung-Leiserson systolic architecture, the proposed "Jumping Systolic Array (JSA)" algorithm can increase the matrix multiplication speed with less processing elements and few data registers attachment. New systolic arrays, such as square jumping array, redundant dummy latency jumping hexagonal array, and compact parallel flow jumping hexagonal array, are also proposed to improve the concurrent system operation efficiency. Experimental results prove that the JSA algorithm can realize fully concurrent operation and dominate other systolic architectures in the specific systolic array system characteristics, such as band width, matrix complexity, or expansion capability.

  • RSFQ Baseband Digital Signal Processing

    Anna Yurievna HERR  


    E91-C No:3

    Ultra fast switching speed of superconducting digital circuits enable realization of Digital Signal Processors with performance unattainable by any other technology. Based on rapid-single-flux technology (RSFQ) logic, these integrated circuits are capable of delivering high computation capacity up to 30 GOPS on a single processor and very short latency of 0.1 ns. There are two main applications of such hardware for practical telecommunication systems: filters for superconducting ADCs operating with digital RF data and recursive filters at baseband. The later of these allows functions such as multiuser detection for 3G WCDMA, equalization and channel precoding for 4G OFDM MIMO, and general blind detection. The performance gain is an increase in the cell capacity, quality of service, and transmitted data rate. The current status of the development of the RSFQ baseband DSP is discussed. Major components with operating speed of 30 GHz have been developed. Designs, test results, and future development of the complete systems including cryopackaging and CMOS interface are reviewed.

  • Toward Incremental Parallelization Using Navigational Programming

    Lei PAN  Wenhui ZHANG  Arthur ASUNCION  Ming Kin LAI  Michael B. DILLENCOURT  Lubomir F. BIC  Laurence T. YANG  

    PAPER-Parallel/Distributed Programming Models, Paradigms and Tools

    E89-D No:2

    The Navigational Programming (NavP) methodology is based on the principle of self-migrating computations. It is a truly incremental methodology for developing parallel programs: each step represents a functioning program, and each intermediate program is an improvement over its predecessor. The transformations are mechanical and straightforward to apply. We illustrate our methodology in the context of matrix multiplication, showing how the transformations lead from a sequential program to a fully parallel program. The NavP methodology is conducive to new ways of thinking that lead to ease of programming and high performance. Even though our parallel algorithm was derived using a sequence of mechanical transformations, it displays certain performance advantages over the classical handcrafted Gentleman's Algorithm.

  • A Super-Programming Technique for Large Sparse Matrix Multiplication on PC Clusters

    Dejiang JIN  Sotirios G. ZIAVRAS  

    PAPER-Scientific and Engineering Computing with Applications

    E87-D No:7

    The multiplication of large spare matrices is a basic operation in many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. For a parallel environment, it is essential to partition the entire operation into well balanced tasks and assign them to individual processing elements. Most of the existing techniques partition the given matrices based on some kind of workload estimation. For irregular sparse matrices on PC clusters, however, the workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may degrade performance. In this paper, we apply our super-programming approach to parallel large matrix multiplication on PC clusters. In our approach, tasks are partitioned into super-instructions that are dynamically assigned to member computer nodes. Thus, the load balancing logic is separated from the computing logic; the former is taken over by the runtime environment. Our super-programming approach facilitates ease of program development and targets high efficiency in dynamic load balancing. Workloads can be balanced effectively and the optimization overhead is small. The results prove the viability of our approach.

  • Algorithms for Matrix Multiplication and the FFT on a Processor Array with Separable Buses

    Takashi MAEBA  Mitsuyoshi SUGAYA  Shoji TATSUMI  Ken'ichi ABE  


    E86-D No:1

    This letter presents parallel algorithms for matrix multiplication and the fast Fourier transform (FFT) that are significant problems arising in engineering and scientific applications. The proposed algorithms are designed on a 3-dimensional processor array with separable buses (PASb). We show that a PASb consisting of N N h processors can compute matrix multiplication of size N N and the FFT of size N in O(N/h+log N) time, respectively. In order to examine ease of hardware implementation, we also evaluate the VLSI complexity of the algorithms. A result obtained achieves an optimal bound on area-time complexity when h=O(N/log N).

  • Generalized Mesh-Connected Computers with Hyperbus Broadcasting for a Computer Network*

    Shi-Jinn HORNG  

    PAPER-Interconnection Networks

    E79-D No:8

    The mesh-connected computers with hyperbus broadcasting are an extension of the mesh-connected computers with multiple broadcasting. Instead of using local buses, we use global buses to connect processors. Such a strategy efficiently reduces the time complexity of the semigroup problem from O(N) to O(log N). Also, the matrix multiplication and the transitive closure problems are solved in O(log N) and O(log2 N) time, respectively. Then, based on these operations, several interesting problems such as the connected recognition problem, the articulation problem, the dominator problem, the bridge problem, the sorting problem, the minimum spanning tree problem and the bipartite graph recognition problem can be solved in the order of polylogarithmic time.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.