1-20hit |
Guangyi LIU Shouyi YIN Xiaokang LIN
Multipath is a big problem for TCP traffic in traffic engineering. To solve it, hash functions such as CRC-16 are usually applied over source and destination address segments in packet headers. Through simulations and performance comparison of several multipath algorithms, it is found out that high network utilization achieved by using hash functions is at the expense of low fairness among coexisting TCP flows. It is also illustrated that packet size has significant influence on performance.
Shouyi YIN Yang HU Zhen ZHANG Leibo LIU Shaojun WEI
Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.
Bing XU Shouyi YIN Leibo LIU Shaojun WEI
Coarse Grained Reconfigurable Architectures (CGRAs) are promising platform based on its high-performance and low cost. Researchers have developed efficient compilers for mapping compute-intensive applications on CGRA using modulo scheduling. In order to generate loop kernel, every stage of kernel are forced to have the same execution time which is determined by the critical PE. Hence non-critical PEs can decrease the supply voltage according to its slack time. The variable Dual-VDD CGRA incorporates this feature to reduce power consumption. Previous work mainly focuses on calculating a global optimal VDDL using overall optimization method that does not fully exploit the flexibility of architecture. In this brief, we adopt variable optimal VDDL in each stage of kernel concerning their pattern respectively instead of the fixed simulated global optimal VDDL. Experiment shows our proposed heuristic approach could reduce the power by 27.6% on average without decreasing performance. The compilation time is also acceptable.
Chongyong YIN Shouyi YIN Leibo LIU Shaojun WEI
Compiler is the most important supporting tool to facilitate the use of reconfigurable computing architecture (RCA). In this paper, a template-based compiler framework is proposed. This compiler can synthesize the executables for RCA from native high-level programming language source code directly. It supports to generate run-time dynamic configuration context. And it is capable to generate both full configuration context and partial configuration context. Experimental results show that the executables generated by the proposed compiler can achieve better execution performance and smaller configuration context size than previous compilers. Moreover, this compiler does not require the programmer to have any extra knowledge about the hardware architecture of RCA.
Shouyi YIN Dajiang LIU Leibo LIU Shaojun WEI
A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.
Rui SHI Shouyi YIN Leibo LIU Qiongbing LIU Shuang LIANG Shaojun WEI
Video Up-scaling is a hotspot in TV display area; as an important brunch of Video Up-scaling, Texture-Based Video Up-scaling (TBVU) method shows great potential of hardware implementation. Coarse-grained Reconfigurable Architecture (CGRA) is a very promising processor; it is a parallel computing platform which provides high performance of hardware, high flexibility of software, and dynamical reconfiguration ability. In this paper we propose an implementation of TBVU on CGRA. We fully exploit the characters of TBVU and utilize several techniques to reduce memory I/O operation and total execution time. Experimental results show that our work can greatly reduce the I/O operation and the execution time compared with the non-optimized ones. We also compare our work with other platforms and find great advantage in execution time and resource utilization rate.
Dajiang LIU Shouyi YIN Leibo LIU Shaojun WEI
The coarse-grained reconfigurable architecture (CGRA) is a promising computing platform that provides both high performance and high power-efficiency. The computation-intensive portions of an application (e.g. loop nests) are often mapped onto CGRA for acceleration. However, mapping loop nests onto CGRA efficiently is quite a challenge due to the special characteristics of CGRA. To optimize the mapping of loop nests onto CGRA, this paper makes three contributions: i) Establishing a precise performance model of mapping loop nests onto CGRA, ii) Formulating the loop nests mapping as a nonlinear optimization problem based on polyhedral model, iii) Extracting an efficient heuristic algorithm and building a complete flow of mapping loop nests onto CGRA (PolyMAP). Experiment results on most kernels of the PolyBench and real-life applications show that our proposed approach can improve the performance of the kernels by 27% on average, as compared to the state-of-the-art methods. The runtime complexity of our approach is also acceptable.
Yansheng WANG Leibo LIU Shouyi YIN Min ZHU Peng CAO Jun YANG Shaojun WEI
RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.
Min ZHU Leibo LIU Shouyi YIN Chongyong YIN Shaojun WEI
This paper introduces a cycle-accurate Simulator for a dynamically REconfigurable MUlti-media System, called SimREMUS. SimREMUS can either be used at transaction-level, which allows the modeling and simulation of higher-level hardware and embedded software, or at register transfer level, if the dynamic system behavior is desired to be observed at signal level. Trade-offs among a set of criteria that are frequently used to characterize the design of a reconfigurable computing system, such as granularity, programmability, configurability as well as architecture of processing elements and route modules etc., can be quickly evaluated. Moreover, a complete tool chain for SimREMUS, including compiler and debugger, is developed. SimREMUS could simulate 270 k cycles per second for million gates SoC (System-on-a-Chip) and produced one H.264 1080p frame in 15 minutes, which might cost days on VCS (platform: CPU: E5200@ 2.5 Ghz, RAM: 2.0 GB). Simulation showed that 1080p@30 fps of H.264 High Profile@ Level 4 can be achieved when exploiting a 200 MHz working frequency on the VLSI architecture of REMUS.
Dajiang LIU Shouyi YIN Chongyong YIN Leibo LIU Shaojun WEI
Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.
Shouyi YIN Chongyong YIN Leibo LIU Min ZHU Shaojun WEI
Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.
Jienan ZHANG Shouyi YIN Peng OUYANG Leibo LIU Shaojun WEI
In this paper we propose a method to use features of an individual object to locate and recognize this object concurrently in a static image with Multi-feature fusion based on multiple objects sample library. This method is proposed based on the observation that lots of previous works focuses on category recognition and takes advantage of common characters of special category to detect the existence of it. However, these algorithms cease to be effective if we search existence of individual objects instead of categories in complex background. To solve this problem, we abandon the concept of category and propose an effective way to use directly features of an individual object as clues to detection and recognition. In our system, we import multi-feature fusion method based on colour histogram and prominent SIFT (p-SIFT) feature to improve detection and recognition accuracy rate. p-SIFT feature is an improved SIFT feature acquired by further feature extraction of correlation information based on Feature Matrix aiming at low computation complexity with good matching rate that is proposed by ourselves. In process of detecting object, we abandon conventional methods and instead take full use of multi-feature to start with a simple but effective way-using colour feature to reduce amounts of patches of interest (POI). Our method is evaluated on several publicly available datasets including Pascal VOC 2005 dataset, Objects101 and datasets provided by Achanta et al.
Shouyi YIN Zhongfu SUN Leibo LIU Shaojun WEI
Motivated by the needs of modern agriculture, in this paper we present CropNET, a wireless multimedia sensor network system for agriculture monitoring. Both hardware and software designs of CropNET are tailored for sensing in wide farmland without human supervision. We have carried out multiple rounds of deployments. The evaluation results show that CropNET performs well and facilitates modern agriculture.
Peng OUYANG Shouyi YIN Leibo LIU Shaojun WEI
More and more mobile devices adopt multi-battery and dynamic voltage scaling policy (DVS) to reduce the energy consumption and extend the battery runtime. However, since the nonlinear characteristics of the multi-battery are not considered, the practical efficiency is not good enough. In order to reduce the energy consumption and extend the battery runtime, this paper proposes an approach based on the battery characteristics to implement the co-optimization of the multi-battery scheduling and dynamic voltage scaling on multi-battery powered systems. In this work, considering the nonlinear discharging characteristics of the existing batteries, we use the Markov process to depict the multi-battery discharging behavior, and build a multi-objective optimal model to denote the energy consumption and battery states, then propose a binary tree based algorithm to solve this model. By means of this method, we get an optimal and applicable scheme about multi-battery scheduling and dynamic voltage scaling. Experimental results show that this approach achieves an average improvement in battery runtime of 17.5% over the current methods in physical implementation.
Tongsheng GENG Leibo LIU Shouyi YIN Min ZHU Shaojun WEI
This paper proposes approaches to perform HW/SW (Hardware/Software) partition and parallelization of computing-intensive tasks of the H.264 HiP (High Profile) decoding algorithm on an embedded coarse-grained reconfigurable multimedia system, called REMUS (REconfigurable MUltimedia System). Several techniques, such as MB (Macro-Block) based parallelization, unfixed sub-block operation etc., are utilized to speed up the decoding process, satisfying the requirements of real-time and high quality H.264 applications. Tests show that the execution performance of MC (Motion Compensation), deblocking, and IDCT-IQ (Inverse Discrete Cosine Transform-Inverse Quantization) on REMUS is improved by 60%, 73%, 88.5% in the typical case and 60%, 69%, 88.5% in the worst case, respectively compared with that on XPP PACT (a commercial reconfigurable processor). Compared with ASIC solutions, the performance of MC is improved by 70%, 74% in the typical and in the worst case, respectively, while those of Deblocking remain the same. As for IDCT_IQ, the performance is improved by 17% no matter in the typical or worst case. Relying on the proposed techniques, 1080p@30 fps of H.264 HiP@ Level 4 decoding could be achieved on REMUS when utilizing a 200 MHz working frequency.
Shouyi YIN Rui SHI Leibo LIU Shaojun WEI
Coarse-grained Reconfigurable Architecture (CGRA) is a parallel computing platform that provides both high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. In this paper, we propose a battery-aware task-mapping method to optimize energy consumption and improve battery lifetime. The proposed method mainly addresses two problems: task partitioning and task scheduling when mapping applications onto CGRA. The task partitioning and scheduling are formulated as a joint optimization problem of minimizing the energy consumption. The nonlinear effects of real battery are taken into account in problem formulation. Using the insights from the problem formulation, we design the task-mapping algorithm. We have used several real-world benchmarks to test the effectiveness of the proposed method. Experiment results show that our method can dramatically lower the energy consumption and prolong the battery-life.
Zhen ZHANG Shouyi YIN Leibo LIU Shaojun WEI
TSV-interconnected 3D chips face problems such as high cost, low yield and large power dissipation. We propose a wireless 3D on-chip-network architecture for application-specific SoC design, using inductive-coupling interconnect instead of TSV for inter-layer communication. Primary design challenge of inductive-coupling 3D SoC is allocating wireless links in the 3D on-chip network effectively. We develop a design flow fully exploiting the design space brought by wireless links while providing flexible tradeoff for user's choice. Experimental results show that our design brings great improvement over uniform design and Sunfloor algorithm on latency (5% to 20%) and power consumption (10% to 45%).
Leibo LIU Dong WANG Yingjie CHEN Min ZHU Shouyi YIN Shaojun WEI
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16×16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6× performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 6.43× performance boosts over a general purpose processor based implementation for HEVC decoding.
Yu PENG Shouyi YIN Leibo LIU Shaojun WEI
Coarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. In an application, loop nests are usually mapped onto CGRA for further acceleration, so optimizing the mapping is an important goal for design of CGRAs. Moreover, obviously almost all of mobile devices are powered by batteries, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results on most kernels of the benchmarks and real-life applications show that our methods can improve the performance of the kernels and lower the energy consumption.
Peng OUYANG Shouyi YIN Hui GAO Leibo LIU Shaojun WEI
Scale Invariant Feature Transform (SIFT) algorithm is a very excellent approach for feature detection. It is characterized by data intensive computation. The current studies of accelerating SIFT algorithm are mainly reflected in three aspects: optimizing the parallel parts of the algorithm based on general-purpose multi-core processors, designing the customized multi-core processor dedicated for SIFT, and implementing it based on the FPGA platform. The real-time performance of SIFT has been highly improved. However, the factors such as the input image size, the number of octaves and scale factors in the SIFT algorithm are restricted for some solutions, the flexibility that ensures the high execution performance under variable factors should be improved. This paper proposes a reconfigurable solution to solve this problem. We fully exploit the algorithm and adopt several techniques, such as full parallel execution, block computation and CORDIC transformation, etc., to improve the execution efficiency on a REconfigurable MUltimedia System called REMUS. Experimental results show that the execution performance of the SIFT is improved by 33%, 50% and 8 times comparing with that executed in the multi-core platform, FPGA and ASIC separately. The scheme of dynamic reconfiguration in this work can configure the circuits to meet the computation requirements under different input image size, different number of octaves and scale factors in the process of computing.