IEICE globals.ieice.org Site

Author Search Result

[Author] Shuai MU(4hit)

1-4hit

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs
Shuai MU Dongdong LI Yubei CHEN Yangdong DENG Zhihua WANG

PAPER-Computer System

Vol:
E96-D No:10
Page(s):
2194-2207
By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
Toward Concurrent Lock-Free Queues on GPUs
Xiangyu ZHANG Yangdong DENG Shuai MU

LETTER-Fundamentals of Information Systems

Vol:
E97-D No:7
Page(s):
1901-1904
General purpose computing on GPU (GPGPU) has become a popular computing model for high-performance, data-intensive applications. Accordingly, there is a strong need to develop highly efficient data structures to ease the development of GPGPU applications. In this work, we proposed an efficient concurrent queue data structure for GPU computing. The GPU based provably correct, lock-free FIFO queue allows a massive number of concurrent producers and consumers. Warp-centric en-queue and de-queue procedures are introduced to better match the underlying Single-Instruction, Multiple-Thread execution model of modern GPUs. It outperforms the best previous GPU queues by up to 40 fold. The correctness of the proposed queue operations is formally validated by linearizability criteria.
Shared Latent Embedding Learning for Multi-View Subspace Clustering
Zhaohu LIU Peng SONG Jinshuai MU Wenming ZHENG

LETTER-Artificial Intelligence, Data Mining

Pubricized:
2023/10/17
Vol:
E107-D No:1
Page(s):
148-152
Most existing multi-view subspace clustering approaches only capture the inter-view similarities between different views and ignore the optimal local geometric structure of the original data. To this end, in this letter, we put forward a novel method named shared latent embedding learning for multi-view subspace clustering (SLE-MSC), which can efficiently capture a better latent space. To be specific, we introduce a pseudo-label constraint to capture the intra-view similarities within each view. Meanwhile, we utilize a novel optimal graph Laplacian to learn the consistent latent representation, in which the common manifold is considered as the optimal manifold to obtain a more reasonable local geometric structure. Comprehensive experimental results indicate the superiority and effectiveness of the proposed method.
Performance Optimization for Sparse A^tAx in Parallel on Multicore CPU
Yuan TAO Yangdong DENG Shuai MU Zhenzhong ZHANG Mingfa ZHU Limin XIAO Li RUAN

LETTER-Fundamentals of Information Systems

Vol:
E97-D No:2
Page(s):
315-318
The sparse matrix operation, y ← y+AtAx, where A is a sparse matrix and x and y are dense vectors, is a widely used computing pattern in High Performance Computing (HPC) applications. The pattern poses challenge to efficient solutions because both a matrix and its transposed version are involved. An efficient sparse matrix format, Compressed Sparse Blocks (CSB), has been proposed to provide nearly the same performance for both Ax and Atx. We develop a multithreaded implementation for the CSB format and apply it to solve y ← y+AtAx. Experiments show that our technique outperforms the Compressed Sparse Row (CSR) based solution in POSKI by up to 2.5 fold on over 70% of benchmarking matrices.

Author Search Result

[Author] Shuai MU(4hit)

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

Toward Concurrent Lock-Free Queues on GPUs

Shared Latent Embedding Learning for Multi-View Subspace Clustering

Performance Optimization for Sparse A^tAx in Parallel on Multicore CPU

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

Author Search Result

[Author] Shuai MU(4hit)

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

Toward Concurrent Lock-Free Queues on GPUs

Shared Latent Embedding Learning for Multi-View Subspace Clustering

Performance Optimization for Sparse AtAx in Parallel on Multicore CPU

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

Performance Optimization for Sparse A^tAx in Parallel on Multicore CPU