Author Search Result

[Author] Peng CAO(8hit)

1-8hit
  • Memory-Efficient and High-Performance Two-Dimensional Discrete Wavelet Transform Architecture Based on Decomposed Lifting Algorithm

    Peng CAO  Chao WANG  Longxing SHI  

     
    PAPER-Digital Signal Processing

      Vol:
    E92-A No:8
      Page(s):
    2000-2008

    The line-based method has been one of the most commonly-used methods of hardware implementation of two-dimensional (2D) discrete wavelet transform (DWT). However, data buffer is required between the row DWT processor and the column DWT processor to solve the data flow mismatch, which increases the on-chip memory size and the output latency. Since the incompatible data flow is induced from the intrinsic property of adopted lifting-based algorithm, a decomposed lifting algorithm (DLA) is presented by rearranging the data path of lifting steps to ensure that image data is processed in raster scan manner in row processor and column processor. Theoretical analysis indicates that the precision issue of DLA outperforms other lifting-based algorithms in terms of round-off noise and internal word-length. A memory-efficient and high-performance line-based architecture is proposed based on DLA without the implementation of data buffer. For an N M image, only 2N internal memory is required for 5/3 filter and 4N of that is required for 9/7 filter to perform 2D DWT, where N and M indicate the width and height of an image. Compared with related 2D DWT architectures, the size of on-chip memory is reduced significantly under the same arithmetic cost, memory bandwidth and timing constraint. This design was implemented in SMIC 0.18 µm CMOS logic fabrication with 32 kbits dual-port RAM and 20 K equivalent 2-input NAND gates in a 1.00 mm 1.00 mm die, which can process 512 512 image under 100 MHz.

  • Parallelism Analysis of H.264 Decoder and Realization on a Coarse-Grained Reconfigurable SoC

    Gugang GAO  Peng CAO  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E96-D No:8
      Page(s):
    1654-1666

    One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called ‘Hybrid partitioning’, for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA system-on-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 44, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 1616 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 44 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.

  • Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip

    Hung K. NGUYEN  Peng CAO  Xue-Xiang WANG  Jun YANG  Longxing SHI  Min ZHU  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    601-615

    REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.

  • Date Flow Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

    Xinning LIU  Chen MEI  Peng CAO  Min ZHU  Longxing SHI  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    374-382

    This paper proposes a novel sub-architecture to optimize the data flow of REMUS-II (REconfigurable MUltimedia System 2), a dynamically coarse grain reconfigurable architecture. REMUS-II consists of a µPU (Micro-Processor Unit) and two RPUs (Reconfigurable Processor Unit), which are used to speeds up control-intensive tasks and data-intensive tasks respectively. The parallel computing capability and flexibility of REMUS-II makes itself an excellent candidate to process multimedia applications, which require a large amount of memory accesses. In this paper, we specifically optimize the data flow to deal with those performance-hazard and energy-hungry memory accessing in order to meet the bandwidth requirement of parallel computing. The RPU internal memory could work in multiple modes, like 2D-access mode and transformation mode, according to different multimedia access patterns. This novel design can improve the performance up to 26% compared to traditional on-chip memory. Meanwhile, the block buffer is implemented to optimize the off-chip data flow through reducing off-chip memory accesses, which reducing up to 43% compared to direct DDR access. Based on RTL simulation, REMUS-II can achieve 1080p@30 fps of H.264 High Profile@ Level 4 and High Level MPEG2 at 200 MHz clock frequency. The REMUS-II is implemented into 23.7 mm2 silicon on TSMC 65 nm logic process with a 400 MHz maximum working frequency.

  • Reconfiguration Process Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

    Bo LIU  Peng CAO  Min ZHU  Jun YANG  Leibo LIU  Shaojun WEI  Longxing SHI  

     
    PAPER-Computer System

      Vol:
    E95-D No:7
      Page(s):
    1858-1871

    This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II ( REMUS-II ). In REMUS-II, the tasks in multi-media applications are divided into two parts: computing-intensive tasks and control-intensive tasks. Two Reconfigurable Processor Units (RPUs) for accelerating computing-intensive tasks and a Micro-Processor Unit (µPU) for accelerating control-intensive tasks are contained in REMUS-II. As a large-scale CGRA, REMUS-II can provide satisfying solutions in terms of both efficiency and flexibility. This feature makes REMUS-II well-suited for video processing, where higher flexibility requirements are posed and a lot of computation tasks are involved. To meet the high requirement of the dynamic reconfiguration performance for multimedia applications, the reconfiguration architecture of REMUS-II should be well designed. To optimize the reconfiguration architecture of REMUS-II, a hierarchical configuration storage structure and a 3-stage reconfiguration processing structure are proposed. Furthermore, several optimization methods for configuration reusing are also introduced, to further improve the performance of reconfiguration process. The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. Experimental results showed that, with the reconfiguration architecture proposed, the performance of reconfiguration process will be improved by 4 times. Based on RTL simulation, REMUS-II can support the 1080p@32 fps of H.264 HiP@Level4 and 1080p@40 fps High-level MPEG-2 stream decoding at the clock frequency of 200 MHz. The proposed REMUS-II system has been implemented on a TSMC 65 nm process. The die size is 23.7 mm2 and the estimated on-chip dynamic power is 620 mW.

  • A GPS Bit Synchronization Method Based on Frequency Compensation

    Xinning LIU  Yuxiang NIU  Jun YANG  Peng CAO  

     
    PAPER-Navigation, Guidance and Control Systems

      Vol:
    E98-B No:4
      Page(s):
    746-753

    TTFF (Time-To-First-Fix) is an important indicator of GPS receiver performance, and must be reduced as much as possible. Bit synchronization is the pre-condition of positioning, which affects TTFF. The frequency error leads to power loss, which makes it difficult to find the bit edge. The conventional bit synchronization methods only work well when there is no or very small frequency error. The bit synchronization process is generally carried out after the pull-in stage, where the carrier loop is already stable. In this paper, a new bit synchronization method based on frequency compensation is proposed. Through compensating the frequency error, the new method reduces the signal power loss caused by the accumulation of coherent integration. The performances of the new method in different frequency error scenarios are compared. The parameters in the proposed method are analyzed and optimized to reduce the computational complexity. Simulation results show that the new method has good performance when the frequency error is less than 25Hz. Test results show that the new method can tolerate dynamic frequency errors, and it is possible to move the bit synchronization to the pull-in process to reduce the TTFF.

  • Image Arbitrary-Ratio Down- and Up-Sampling Scheme Exploiting DCT Low Frequency Components and Sparsity in High Frequency Components

    Meng ZHANG  Tinghuan CHEN  Xuchao SHI  Peng CAO  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2015/10/30
      Vol:
    E99-D No:2
      Page(s):
    475-487

    The development of image acquisition technology and display technology provide the base for popularization of high-resolution images. On the other hand, the available bandwidth is not always enough to data stream such high-resolution images. Down- and up-sampling, which decreases the data volume of images and increases back to high-resolution images, is a solution for the transmission of high-resolution images. In this paper, motivated by the observation that the high-frequency DCT components are sparse in the spatial domain, we propose a scheme combined with Discrete Cosine Transform (DCT) and Compressed Sensing (CS) to achieve arbitrary-ratio down-sampling. Our proposed scheme makes use of two properties: First, the energy of a image concentrates on the low-frequency DCT components. Second, the high-frequency DCT components are sparse in the spatial domain. The scheme is able to preserve the most information and avoid absolutely blindly estimating the high-frequency components. Experimental results show that the proposed down- and up-sampling scheme produces better performance compared with some state-of-the-art schemes in terms of peak signal to noise ratio (PSNR), structural similarity index measurement (SSIM) and processing time.

  • The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture

    Yansheng WANG  Leibo LIU  Shouyi YIN  Min ZHU  Peng CAO  Jun YANG  Shaojun WEI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E96-A No:11
      Page(s):
    2218-2229

    RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.