1-8hit |
Peng CAO Chao WANG Longxing SHI
The line-based method has been one of the most commonly-used methods of hardware implementation of two-dimensional (2D) discrete wavelet transform (DWT). However, data buffer is required between the row DWT processor and the column DWT processor to solve the data flow mismatch, which increases the on-chip memory size and the output latency. Since the incompatible data flow is induced from the intrinsic property of adopted lifting-based algorithm, a decomposed lifting algorithm (DLA) is presented by rearranging the data path of lifting steps to ensure that image data is processed in raster scan manner in row processor and column processor. Theoretical analysis indicates that the precision issue of DLA outperforms other lifting-based algorithms in terms of round-off noise and internal word-length. A memory-efficient and high-performance line-based architecture is proposed based on DLA without the implementation of data buffer. For an N M image, only 2N internal memory is required for 5/3 filter and 4N of that is required for 9/7 filter to perform 2D DWT, where N and M indicate the width and height of an image. Compared with related 2D DWT architectures, the size of on-chip memory is reduced significantly under the same arithmetic cost, memory bandwidth and timing constraint. This design was implemented in SMIC 0.18 µm CMOS logic fabrication with 32 kbits dual-port RAM and 20 K equivalent 2-input NAND gates in a 1.00 mm 1.00 mm die, which can process 512 512 image under 100 MHz.
Gugang GAO Peng CAO Jun YANG Longxing SHI
One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called ‘Hybrid partitioning’, for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA system-on-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 44, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 1616 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 44 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.
Hung K. NGUYEN Peng CAO Xue-Xiang WANG Jun YANG Longxing SHI Min ZHU Leibo LIU Shaojun WEI
REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.
Xinning LIU Chen MEI Peng CAO Min ZHU Longxing SHI
This paper proposes a novel sub-architecture to optimize the data flow of REMUS-II (REconfigurable MUltimedia System 2), a dynamically coarse grain reconfigurable architecture. REMUS-II consists of a µPU (Micro-Processor Unit) and two RPUs (Reconfigurable Processor Unit), which are used to speeds up control-intensive tasks and data-intensive tasks respectively. The parallel computing capability and flexibility of REMUS-II makes itself an excellent candidate to process multimedia applications, which require a large amount of memory accesses. In this paper, we specifically optimize the data flow to deal with those performance-hazard and energy-hungry memory accessing in order to meet the bandwidth requirement of parallel computing. The RPU internal memory could work in multiple modes, like 2D-access mode and transformation mode, according to different multimedia access patterns. This novel design can improve the performance up to 26% compared to traditional on-chip memory. Meanwhile, the block buffer is implemented to optimize the off-chip data flow through reducing off-chip memory accesses, which reducing up to 43% compared to direct DDR access. Based on RTL simulation, REMUS-II can achieve 1080p@30 fps of H.264 High Profile@ Level 4 and High Level MPEG2 at 200 MHz clock frequency. The REMUS-II is implemented into 23.7 mm2 silicon on TSMC 65 nm logic process with a 400 MHz maximum working frequency.
Bo LIU Peng CAO Min ZHU Jun YANG Leibo LIU Shaojun WEI Longxing SHI
This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II ( REMUS-II ). In REMUS-II, the tasks in multi-media applications are divided into two parts: computing-intensive tasks and control-intensive tasks. Two Reconfigurable Processor Units (RPUs) for accelerating computing-intensive tasks and a Micro-Processor Unit (µPU) for accelerating control-intensive tasks are contained in REMUS-II. As a large-scale CGRA, REMUS-II can provide satisfying solutions in terms of both efficiency and flexibility. This feature makes REMUS-II well-suited for video processing, where higher flexibility requirements are posed and a lot of computation tasks are involved. To meet the high requirement of the dynamic reconfiguration performance for multimedia applications, the reconfiguration architecture of REMUS-II should be well designed. To optimize the reconfiguration architecture of REMUS-II, a hierarchical configuration storage structure and a 3-stage reconfiguration processing structure are proposed. Furthermore, several optimization methods for configuration reusing are also introduced, to further improve the performance of reconfiguration process. The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. Experimental results showed that, with the reconfiguration architecture proposed, the performance of reconfiguration process will be improved by 4 times. Based on RTL simulation, REMUS-II can support the 1080p@32 fps of H.264 HiP@Level4 and 1080p@40 fps High-level MPEG-2 stream decoding at the clock frequency of 200 MHz. The proposed REMUS-II system has been implemented on a TSMC 65 nm process. The die size is 23.7 mm2 and the estimated on-chip dynamic power is 620 mW.
Xinning LIU Yuxiang NIU Jun YANG Peng CAO
TTFF (Time-To-First-Fix) is an important indicator of GPS receiver performance, and must be reduced as much as possible. Bit synchronization is the pre-condition of positioning, which affects TTFF. The frequency error leads to power loss, which makes it difficult to find the bit edge. The conventional bit synchronization methods only work well when there is no or very small frequency error. The bit synchronization process is generally carried out after the pull-in stage, where the carrier loop is already stable. In this paper, a new bit synchronization method based on frequency compensation is proposed. Through compensating the frequency error, the new method reduces the signal power loss caused by the accumulation of coherent integration. The performances of the new method in different frequency error scenarios are compared. The parameters in the proposed method are analyzed and optimized to reduce the computational complexity. Simulation results show that the new method has good performance when the frequency error is less than 25Hz. Test results show that the new method can tolerate dynamic frequency errors, and it is possible to move the bit synchronization to the pull-in process to reduce the TTFF.
Meng ZHANG Tinghuan CHEN Xuchao SHI Peng CAO
The development of image acquisition technology and display technology provide the base for popularization of high-resolution images. On the other hand, the available bandwidth is not always enough to data stream such high-resolution images. Down- and up-sampling, which decreases the data volume of images and increases back to high-resolution images, is a solution for the transmission of high-resolution images. In this paper, motivated by the observation that the high-frequency DCT components are sparse in the spatial domain, we propose a scheme combined with Discrete Cosine Transform (DCT) and Compressed Sensing (CS) to achieve arbitrary-ratio down-sampling. Our proposed scheme makes use of two properties: First, the energy of a image concentrates on the low-frequency DCT components. Second, the high-frequency DCT components are sparse in the spatial domain. The scheme is able to preserve the most information and avoid absolutely blindly estimating the high-frequency components. Experimental results show that the proposed down- and up-sampling scheme produces better performance compared with some state-of-the-art schemes in terms of peak signal to noise ratio (PSNR), structural similarity index measurement (SSIM) and processing time.
Yansheng WANG Leibo LIU Shouyi YIN Min ZHU Peng CAO Jun YANG Shaojun WEI
RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.