IEICE globals.ieice.org Site

Author Search Result

[Author] Shouyi YIN(20hit)

1-20hit

Performance Comparison of Multipath Routing Algorithms for TCP Traffic
Guangyi LIU Shouyi YIN Xiaokang LIN

LETTER-Network

Vol:
E86-B No:10
Page(s):
3144-3146
Multipath is a big problem for TCP traffic in traffic engineering. To solve it, hash functions such as CRC-16 are usually applied over source and destination address segments in packet headers. Through simulations and performance comparison of several multipath algorithms, it is found out that high network utilization achieved by using hash functions is at the expense of low fairness among coexisting TCP flows. It is also illustrated that packet size has significant influence on performance.
Hybrid Wired/Wireless On-Chip Network Design for Application-Specific SoC
Shouyi YIN Yang HU Zhen ZHANG Leibo LIU Shaojun WEI

PAPER

Vol:
E95-C No:4
Page(s):
495-505
Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.
Low-Power Loop Parallelization onto CGRA Utilizing Variable Dual V_DD
Bing XU Shouyi YIN Leibo LIU Shaojun WEI

PAPER-Architecture

Pubricized:
2014/11/19
Vol:
E98-D No:2
Page(s):
243-251
Coarse Grained Reconfigurable Architectures (CGRAs) are promising platform based on its high-performance and low cost. Researchers have developed efficient compilers for mapping compute-intensive applications on CGRA using modulo scheduling. In order to generate loop kernel, every stage of kernel are forced to have the same execution time which is determined by the critical PE. Hence non-critical PEs can decrease the supply voltage according to its slack time. The variable Dual-VDD CGRA incorporates this feature to reduce power consumption. Previous work mainly focuses on calculating a global optimal VDDL using overall optimization method that does not fully exploit the flexibility of architecture. In this brief, we adopt variable optimal VDDL in each stage of kernel concerning their pattern respectively instead of the fixed simulated global optimal VDDL. Experiment shows our proposed heuristic approach could reduce the power by 27.6% on average without decreasing performance. The compilation time is also acceptable.
Compiler Framework for Reconfigurable Computing Architecture
Chongyong YIN Shouyi YIN Leibo LIU Shaojun WEI

BRIEF PAPER

Vol:
E92-C No:10
Page(s):
1284-1290
Compiler is the most important supporting tool to facilitate the use of reconfigurable computing architecture (RCA). In this paper, a template-based compiler framework is proposed. This compiler can synthesize the executables for RCA from native high-level programming language source code directly. It supports to generate run-time dynamic configuration context. And it is capable to generate both full configuration context and partial configuration context. Experimental results show that the executables generated by the proposed compiler can achieve better execution performance and smaller configuration context size than previous compilers. Moreover, this compiler does not require the programmer to have any extra knowledge about the hardware architecture of RCA.
Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs
Shouyi YIN Dajiang LIU Leibo LIU Shaojun WEI

PAPER-Design Methodology

Vol:
E96-D No:8
Page(s):
1582-1591
A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.
The Implementation of Texture-Based Video Up-Scaling on Coarse-Grained Reconfigurable Architecture
Rui SHI Shouyi YIN Leibo LIU Qiongbing LIU Shuang LIANG Shaojun WEI

PAPER-Application

Pubricized:
2014/11/19
Vol:
E98-D No:2
Page(s):
276-287
Video Up-scaling is a hotspot in TV display area; as an important brunch of Video Up-scaling, Texture-Based Video Up-scaling (TBVU) method shows great potential of hardware implementation. Coarse-grained Reconfigurable Architecture (CGRA) is a very promising processor; it is a parallel computing platform which provides high performance of hardware, high flexibility of software, and dynamical reconfiguration ability. In this paper we propose an implementation of TBVU on CGRA. We fully exploit the characters of TBVU and utilize several techniques to reduce memory I/O operation and total execution time. Experimental results show that our work can greatly reduce the I/O operation and the execution time compared with the non-optimized ones. We also compare our work with other platforms and find great advantage in execution time and resource utilization rate.
Mapping Multi-Level Loop Nests onto CGRAs Using Polyhedral Optimizations
Dajiang LIU Shouyi YIN Leibo LIU Shaojun WEI

PAPER

Vol:
E98-A No:7
Page(s):
1419-1430
The coarse-grained reconfigurable architecture (CGRA) is a promising computing platform that provides both high performance and high power-efficiency. The computation-intensive portions of an application (e.g. loop nests) are often mapped onto CGRA for acceleration. However, mapping loop nests onto CGRA efficiently is quite a challenge due to the special characteristics of CGRA. To optimize the mapping of loop nests onto CGRA, this paper makes three contributions: i) Establishing a precise performance model of mapping loop nests onto CGRA, ii) Formulating the loop nests mapping as a nonlinear optimization problem based on polyhedral model, iii) Extracting an efficient heuristic algorithm and building a complete flow of mapping loop nests onto CGRA (PolyMAP). Experiment results on most kernels of the PolyBench and real-life applications show that our proposed approach can improve the performance of the kernels by 27% on average, as compared to the state-of-the-art methods. The runtime complexity of our approach is also acceptable.
The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture
Yansheng WANG Leibo LIU Shouyi YIN Min ZHU Peng CAO Jun YANG Shaojun WEI

PAPER-VLSI Design Technology and CAD

Vol:
E96-A No:11
Page(s):
2218-2229
RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.
A Cycle-Accurate Simulator for a Reconfigurable Multi-Media System
Min ZHU Leibo LIU Shouyi YIN Chongyong YIN Shaojun WEI

PAPER

Vol:
E93-D No:12
Page(s):
3202-3210
This paper introduces a cycle-accurate Simulator for a dynamically REconfigurable MUlti-media System, called SimREMUS. SimREMUS can either be used at transaction-level, which allows the modeling and simulation of higher-level hardware and embedded software, or at register transfer level, if the dynamic system behavior is desired to be observed at signal level. Trade-offs among a set of criteria that are frequently used to characterize the design of a reconfigurable computing system, such as granularity, programmability, configurability as well as architecture of processing elements and route modules etc., can be quickly evaluated. Moreover, a complete tool chain for SimREMUS, including compiler and debugger, is developed. SimREMUS could simulate 270 k cycles per second for million gates SoC (System-on-a-Chip) and produced one H.264 1080p frame in 15 minutes, which might cost days on VCS (platform: CPU: E5200@ 2.5 Ghz, RAM: 2.0 GB). Simulation showed that 1080p@30 fps of H.264 High Profile@ Level 4 can be achieved when exploiting a 200 MHz working frequency on the VLSI architecture of REMUS.
Mapping Optimization of Affine Loop Nests for Reconfigurable Computing Architecture
Dajiang LIU Shouyi YIN Chongyong YIN Leibo LIU Shaojun WEI

PAPER-Computer Architecture

Vol:
E95-D No:12
Page(s):
2898-2907
Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.
Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture
Shouyi YIN Chongyong YIN Leibo LIU Min ZHU Shaojun WEI

PAPER-Design Methodology

Vol:
E95-D No:2
Page(s):
335-344
Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.
Concurrent Detection and Recognition of Individual Object Based on Colour and p-SIFT Features
Jienan ZHANG Shouyi YIN Peng OUYANG Leibo LIU Shaojun WEI

PAPER

Vol:
E96-A No:6
Page(s):
1357-1365
In this paper we propose a method to use features of an individual object to locate and recognize this object concurrently in a static image with Multi-feature fusion based on multiple objects sample library. This method is proposed based on the observation that lots of previous works focuses on category recognition and takes advantage of common characters of special category to detect the existence of it. However, these algorithms cease to be effective if we search existence of individual objects instead of categories in complex background. To solve this problem, we abandon the concept of category and propose an effective way to use directly features of an individual object as clues to detection and recognition. In our system, we import multi-feature fusion method based on colour histogram and prominent SIFT (p-SIFT) feature to improve detection and recognition accuracy rate. p-SIFT feature is an improved SIFT feature acquired by further feature extraction of correlation information based on Feature Matrix aiming at low computation complexity with good matching rate that is proposed by ourselves. In process of detecting object, we abandon conventional methods and instead take full use of multi-feature to start with a simple but effective way-using colour feature to reduce amounts of patches of interest (POI). Our method is evaluated on several publicly available datasets including Pascal VOC 2005 dataset, Objects101 and datasets provided by Achanta et al.
CropNET: A Wireless Multimedia Sensor Network for Agricultural Monitoring
Shouyi YIN Zhongfu SUN Leibo LIU Shaojun WEI

LETTER

Vol:
E93-B No:8
Page(s):
2073-2076
Motivated by the needs of modern agriculture, in this paper we present CropNET, a wireless multimedia sensor network system for agriculture monitoring. Both hardware and software designs of CropNET are tailored for sensing in wide farmland without human supervision. We have carried out multiple rounds of deployments. The evaluation results show that CropNET performs well and facilitates modern agriculture.
Multi-Battery Scheduling for Battery-Powered DVS Systems
Peng OUYANG Shouyi YIN Leibo LIU Shaojun WEI

PAPER-Energy in Electronics Communications

Vol:
E95-B No:7
Page(s):
2278-2285
More and more mobile devices adopt multi-battery and dynamic voltage scaling policy (DVS) to reduce the energy consumption and extend the battery runtime. However, since the nonlinear characteristics of the multi-battery are not considered, the practical efficiency is not good enough. In order to reduce the energy consumption and extend the battery runtime, this paper proposes an approach based on the battery characteristics to implement the co-optimization of the multi-battery scheduling and dynamic voltage scaling on multi-battery powered systems. In this work, considering the nonlinear discharging characteristics of the existing batteries, we use the Markov process to depict the multi-battery discharging behavior, and build a multi-objective optimal model to denote the energy consumption and battery states, then propose a binary tree based algorithm to solve this model. By means of this method, we get an optimal and applicable scheme about multi-battery scheduling and dynamic voltage scaling. Experimental results show that this approach achieves an average improvement in battery runtime of 17.5% over the current methods in physical implementation.
Parallelization of Computing-Intensive Tasks of the H.264 High Profile Decoding Algorithm on a Reconfigurable Multimedia System
Tongsheng GENG Leibo LIU Shouyi YIN Min ZHU Shaojun WEI

PAPER

Vol:
E93-D No:12
Page(s):
3223-3231
This paper proposes approaches to perform HW/SW (Hardware/Software) partition and parallelization of computing-intensive tasks of the H.264 HiP (High Profile) decoding algorithm on an embedded coarse-grained reconfigurable multimedia system, called REMUS (REconfigurable MUltimedia System). Several techniques, such as MB (Macro-Block) based parallelization, unfixed sub-block operation etc., are utilized to speed up the decoding process, satisfying the requirements of real-time and high quality H.264 applications. Tests show that the execution performance of MC (Motion Compensation), deblocking, and IDCT-IQ (Inverse Discrete Cosine Transform-Inverse Quantization) on REMUS is improved by 60%, 73%, 88.5% in the typical case and 60%, 69%, 88.5% in the worst case, respectively compared with that on XPP PACT (a commercial reconfigurable processor). Compared with ASIC solutions, the performance of MC is improved by 70%, 74% in the typical and in the worst case, respectively, while those of Deblocking remain the same. As for IDCT_IQ, the performance is improved by 17% no matter in the typical or worst case. Relying on the proposed techniques, 1080p@30 fps of H.264 HiP@ Level 4 decoding could be achieved on REMUS when utilizing a 200 MHz working frequency.
Battery-Aware Task Mapping for Coarse-Grained Reconfigurable Architecture
Shouyi YIN Rui SHI Leibo LIU Shaojun WEI

PAPER

Vol:
E96-D No:12
Page(s):
2524-2535
Coarse-grained Reconfigurable Architecture (CGRA) is a parallel computing platform that provides both high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. In this paper, we propose a battery-aware task-mapping method to optimize energy consumption and improve battery lifetime. The proposed method mainly addresses two problems: task partitioning and task scheduling when mapping applications onto CGRA. The task partitioning and scheduling are formulated as a joint optimization problem of minimizing the energy consumption. The nonlinear effects of real battery are taken into account in problem formulation. Using the insights from the problem formulation, we design the task-mapping algorithm. We have used several real-world benchmarks to test the effectiveness of the proposed method. Experiment results show that our method can dramatically lower the energy consumption and prolong the battery-life.
An Inductive-Coupling Interconnected Application-Specific 3D NoC Design
Zhen ZHANG Shouyi YIN Leibo LIU Shaojun WEI

PAPER-High-Level Synthesis and System-Level Design

Vol:
E96-A No:12
Page(s):
2633-2644
TSV-interconnected 3D chips face problems such as high cost, low yield and large power dissipation. We propose a wireless 3D on-chip-network architecture for application-specific SoC design, using inductive-coupling interconnect instead of TSV for inter-layer communication. Primary design challenge of inductive-coupling 3D SoC is allocating wireless links in the 3D on-chip network effectively. We develop a design flow fully exploiting the design space brought by wireless links while providing flexible tradeoff for user's choice. Experimental results show that our design brings great improvement over uniform design and Sunfloor algorithm on latency (5% to 20%) and power consumption (10% to 45%).
An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform
Leibo LIU Dong WANG Yingjie CHEN Min ZHU Shouyi YIN Shaojun WEI

PAPER-Computer System

Pubricized:
2016/02/02
Vol:
E99-D No:5
Page(s):
1285-1295
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16×16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6× performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 6.43× performance boosts over a general purpose processor based implementation for HEVC decoding.
Battery-Aware Loop Nests Mapping for CGRAs
Yu PENG Shouyi YIN Leibo LIU Shaojun WEI

PAPER-Architecture

Vol:
E98-D No:2
Page(s):
230-242
Coarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. In an application, loop nests are usually mapped onto CGRA for further acceleration, so optimizing the mapping is an important goal for design of CGRAs. Moreover, obviously almost all of mobile devices are powered by batteries, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results on most kernels of the benchmarks and real-life applications show that our methods can improve the performance of the kernels and lower the energy consumption.
Parallelization of Computing-Intensive Tasks of SIFT Algorithm on a Reconfigurable Architecture System
Peng OUYANG Shouyi YIN Hui GAO Leibo LIU Shaojun WEI

PAPER

Vol:
E96-A No:6
Page(s):
1393-1402
Scale Invariant Feature Transform (SIFT) algorithm is a very excellent approach for feature detection. It is characterized by data intensive computation. The current studies of accelerating SIFT algorithm are mainly reflected in three aspects: optimizing the parallel parts of the algorithm based on general-purpose multi-core processors, designing the customized multi-core processor dedicated for SIFT, and implementing it based on the FPGA platform. The real-time performance of SIFT has been highly improved. However, the factors such as the input image size, the number of octaves and scale factors in the SIFT algorithm are restricted for some solutions, the flexibility that ensures the high execution performance under variable factors should be improved. This paper proposes a reconfigurable solution to solve this problem. We fully exploit the algorithm and adopt several techniques, such as full parallel execution, block computation and CORDIC transformation, etc., to improve the execution efficiency on a REconfigurable MUltimedia System called REMUS. Experimental results show that the execution performance of the SIFT is improved by 33%, 50% and 8 times comparing with that executed in the multi-core platform, FPGA and ASIC separately. The scheme of dynamic reconfiguration in this work can configure the circuits to meet the computation requirements under different input image size, different number of octaves and scale factors in the process of computing.

Author Search Result

[Author] Shouyi YIN(20hit)

Performance Comparison of Multipath Routing Algorithms for TCP Traffic

Hybrid Wired/Wireless On-Chip Network Design for Application-Specific SoC

Low-Power Loop Parallelization onto CGRA Utilizing Variable Dual V_DD

Compiler Framework for Reconfigurable Computing Architecture

Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

The Implementation of Texture-Based Video Up-Scaling on Coarse-Grained Reconfigurable Architecture

Mapping Multi-Level Loop Nests onto CGRAs Using Polyhedral Optimizations

The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture

A Cycle-Accurate Simulator for a Reconfigurable Multi-Media System

Mapping Optimization of Affine Loop Nests for Reconfigurable Computing Architecture

Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture

Concurrent Detection and Recognition of Individual Object Based on Colour and p-SIFT Features

CropNET: A Wireless Multimedia Sensor Network for Agricultural Monitoring

Multi-Battery Scheduling for Battery-Powered DVS Systems

Parallelization of Computing-Intensive Tasks of the H.264 High Profile Decoding Algorithm on a Reconfigurable Multimedia System

Battery-Aware Task Mapping for Coarse-Grained Reconfigurable Architecture

An Inductive-Coupling Interconnected Application-Specific 3D NoC Design

An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform

Battery-Aware Loop Nests Mapping for CGRAs

Parallelization of Computing-Intensive Tasks of SIFT Algorithm on a Reconfigurable Architecture System

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles