Author Search Result

[Author] Shuai MU(4hit)

1-4hit
  • Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

    Shuai MU  Dongdong LI  Yubei CHEN  Yangdong DENG  Zhihua WANG  

     
    PAPER-Computer System

      Vol:
    E96-D No:10
      Page(s):
    2194-2207

    By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.

  • Toward Concurrent Lock-Free Queues on GPUs

    Xiangyu ZHANG  Yangdong DENG  Shuai MU  

     
    LETTER-Fundamentals of Information Systems

      Vol:
    E97-D No:7
      Page(s):
    1901-1904

    General purpose computing on GPU (GPGPU) has become a popular computing model for high-performance, data-intensive applications. Accordingly, there is a strong need to develop highly efficient data structures to ease the development of GPGPU applications. In this work, we proposed an efficient concurrent queue data structure for GPU computing. The GPU based provably correct, lock-free FIFO queue allows a massive number of concurrent producers and consumers. Warp-centric en-queue and de-queue procedures are introduced to better match the underlying Single-Instruction, Multiple-Thread execution model of modern GPUs. It outperforms the best previous GPU queues by up to 40 fold. The correctness of the proposed queue operations is formally validated by linearizability criteria.

  • Shared Latent Embedding Learning for Multi-View Subspace Clustering

    Zhaohu LIU  Peng SONG  Jinshuai MU  Wenming ZHENG  

     
    LETTER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/17
      Vol:
    E107-D No:1
      Page(s):
    148-152

    Most existing multi-view subspace clustering approaches only capture the inter-view similarities between different views and ignore the optimal local geometric structure of the original data. To this end, in this letter, we put forward a novel method named shared latent embedding learning for multi-view subspace clustering (SLE-MSC), which can efficiently capture a better latent space. To be specific, we introduce a pseudo-label constraint to capture the intra-view similarities within each view. Meanwhile, we utilize a novel optimal graph Laplacian to learn the consistent latent representation, in which the common manifold is considered as the optimal manifold to obtain a more reasonable local geometric structure. Comprehensive experimental results indicate the superiority and effectiveness of the proposed method.

  • Performance Optimization for Sparse AtAx in Parallel on Multicore CPU

    Yuan TAO  Yangdong DENG  Shuai MU  Zhenzhong ZHANG  Mingfa ZHU  Limin XIAO  Li RUAN  

     
    LETTER-Fundamentals of Information Systems

      Vol:
    E97-D No:2
      Page(s):
    315-318

    The sparse matrix operation, y ← y+AtAx, where A is a sparse matrix and x and y are dense vectors, is a widely used computing pattern in High Performance Computing (HPC) applications. The pattern poses challenge to efficient solutions because both a matrix and its transposed version are involved. An efficient sparse matrix format, Compressed Sparse Blocks (CSB), has been proposed to provide nearly the same performance for both Ax and Atx. We develop a multithreaded implementation for the CSB format and apply it to solve y ← y+AtAx. Experiments show that our technique outperforms the Compressed Sparse Row (CSR) based solution in POSKI by up to 2.5 fold on over 70% of benchmarking matrices.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.