Author Search Result

[Author] Xiaoyang ZENG(23hit)

1-20hit(23hit)

  • A Fully Programmable Reed-Solomon Decoder on a Multi-Core Processor Platform

    Bei HUANG  Kaidi YOU  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2939-2947

    Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Unlike usual VLSI approaches, this paper presents a high throughput fully programmable Reed-Solomon decoder on a multi-core processor. The multi-core processor platform is a 2-Dimension mesh array of Single Instruction Multiple Data (SIMD) cores, and it is well suited for digital communication applications. By fully extracting the parallelizable operations of the RS decoding process, we propose multiple optimization techniques to improve system throughput, including: task level parallelism on different cores, data level parallelism on each SIMD core, minimizing memory access, and route length minimized task mapping techniques. For RS(255, 239, 8), experimental results show that our 12-core implementation achieve a throughput of 4.35 Gbps, which is much better than several other published implementations. From the results, it is predictable that the throughput is linear with the number of cores by our approach.

  • A Cost-Efficient LDPC Decoder for DVB-S2 with the Solution to Address Conflict Issue

    Yan YING  Dan BAO  Zhiyi YU  Xiaoyang ZENG  Yun CHEN  

     
    PAPER-Digital Signal Processing

      Vol:
    E93-A No:8
      Page(s):
    1415-1424

    In this paper, a cost-efficient LDPC decoder for DVB-S2 is presented. Based on the Normalized Min-Sum algorithm and the turbo-decoding message-passing (TDMP) algorithm, a dual line-scan scheduling is proposed to enable hardware reusing. Furthermore, we present the solution to the address conflict issue caused by the characteristic of the parity-check matrix defined by DVB-S2 LDPC codes. Based on SMIC 0.13 µm standard CMOS process, the LDPC decoder has an area of 12.51 mm2. The required operating frequency to meet the throughput requirement of 135 Mbps with maximum iteration number of 30 is 105 MHz. Compared with the latest published DVB-S2 LDPC decoder, the proposed decoder reduces area cost by 34%.

  • A Flexible LDPC Decoder Architecture Supporting TPMP and TDMP Decoding Algorithms

    Shuangqu HUANG  Xiaoyang ZENG  Yun CHEN  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    403-412

    In this paper a programmable and area-efficient decoder architecture supporting two decoding algorithms for Block-LDPC codes is presented. The novel decoder can be configured to decode in either TPMP or TDMP decoding mode according to different Block-LDPC codes, essentially combining the advantages of two decoding algorithms. With a regular and scalable data-path, a Reconfigurable Serial Processing Engine (RSPE) is proposed to achieve area efficiency. To verify our proposed architecture, a flexible LDPC decoder fully compliant to IEEE 802.16e applications is implemented on a 130 nm 1P8M CMOS technology with a total area of 6.3 mm2 and maximum operating frequency of 250 MHz. The chip dissipates 592 mW when operates at 250 MHz frequency and 1.2 V supply.

  • CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

    Aorui GOU  Jingjing LIU  Xiaoxiang CHEN  Xiaoyang ZENG  Yibo FAN  

     
    PAPER-Image

      Pubricized:
    2023/07/06
      Vol:
    E107-A No:1
      Page(s):
    141-156

    Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable performance in detection and classification tasks. Nevertheless, their feature extraction cannot consider both local and global information, so the detection and classification performance can be further improved. In addition, more and more deep learning networks are designed as more and more complex, and the amount of computation and storage space required is also significantly increased. This paper proposes a combination of CNN and transformer, and designs a local feature enhancement module and global context modeling module to enhance the cascade network. While the local feature enhancement module increases the range of feature extraction, the global context modeling is used to capture the feature maps' global information. To decrease the model complexity, a shared sublayer is designed to realize the sharing of weight parameters between the adjacent convolutional layers or cross convolutional layers, thereby reducing the number of convolutional weight parameters. Moreover, to effectively improve the detection performance of neural networks without increasing network parameters, the optimal transport assignment approach is proposed to resolve the problem of label assignment. The classification loss and regression loss are the summations of the cost between the demander and supplier. The experiment results demonstrate that the proposed Combination of CNN and Transformer with Shared Sublayer (CCTSS) performs better than the state-of-the-art methods in various datasets and applications.

  • Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm

    Yuli ZHANG  Jun HAN  Xinqian WENG  Zhongzhu HE  Xiaoyang ZENG  

     
    PAPER-Electronic Circuits

      Vol:
    E95-C No:8
      Page(s):
    1415-1426

    This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335 Mbps and 176 Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.

  • An Attention Nested U-Structure Suitable for Salient Ship Detection in Complex Maritime Environment

    Weina ZHOU  Ying ZHOU  Xiaoyang ZENG  

     
    PAPER-Information Network

      Pubricized:
    2022/03/23
      Vol:
    E105-D No:6
      Page(s):
    1164-1171

    Salient ship detection plays an important role in ensuring the safety of maritime transportation and navigation. However, due to the influence of waves, special weather, and illumination on the sea, existing saliency methods are still unable to achieve effective ship detection in a complex marine environment. To solve the problem, this paper proposed a novel saliency method based on an attention nested U-Structure (AU2Net). First, to make up for the shortcomings of the U-shaped structure, the pyramid pooling module (PPM) and global guidance paths (GGPs) are designed to guide the restoration of feature information. Then, the attention modules are added to the nested U-shaped structure to further refine the target characteristics. Ultimately, multi-level features and global context features are integrated through the feature aggregation module (FAM) to improve the ability to locate targets. Experiment results demonstrate that the proposed method could have at most 36.75% improvement in F-measure (Favg) compared to the other state-of-the-art methods.

  • Obstacle Detection for Unmanned Surface Vehicles by Fusion Refinement Network

    Weina ZHOU  Xinxin HUANG  Xiaoyang ZENG  

     
    PAPER-Information Network

      Pubricized:
    2022/05/12
      Vol:
    E105-D No:8
      Page(s):
    1393-1400

    As a kind of marine vehicles, Unmanned Surface Vehicles (USV) are widely used in military and civilian fields because of their low cost, good concealment, strong mobility and high speed. High-precision detection of obstacles plays an important role in USV autonomous navigation, which ensures its subsequent path planning. In order to further improve obstacle detection performance, we propose an encoder-decoder architecture named Fusion Refinement Network (FRN). The encoder part with a deeper network structure enables it to extract more rich visual features. In particular, a dilated convolution layer is used in the encoder for obtaining a large range of obstacle features in complex marine environment. The decoder part achieves the multiple path feature fusion. Attention Refinement Modules (ARM) are added to optimize features, and a learnable fusion algorithm called Feature Fusion Module (FFM) is used to fuse visual information. Experimental validation results on three different datasets with real marine images show that FRN is superior to state-of-the-art semantic segmentation networks in performance evaluation. And the MIoU and MPA of the FRN can peak at 97.01% and 98.37% respectively. Moreover, FRN could maintain a high accuracy with only 27.67M parameters, which is much smaller than the latest obstacle detection network (WaSR) for USV.

  • A Flexible Architecture for TURBO and LDPC Codes

    Yun CHEN  Yuebin HUANG  Chen CHEN  Changsheng ZHOU  Xiaoyang ZENG  

     
    LETTER-High-Level Synthesis and System-Level Design

      Vol:
    E95-A No:12
      Page(s):
    2392-2395

    Turbo codes and LDPC (Low-Density Parity-Check) codes are two of the most powerful error correction codes that can approach Shannon limit in many communication systems. But there are little architecture presented to support both LDPC and Turbo codes, especially by the means of ASIC. This paper have implemented a common architecture that can decode LDPC and Turbo codes, and it is capable of supporting the WiMAX, WiFi, 3GPP-LTE standard on the same hardware. In this paper, we will carefully describe how to share memory and logic devices in different operation mode. The chip is design in a 130 nm CMOS technology, and the maximum clock frequency can reach up to 160 MHz. The maximum throughput is about 104 Mbps@5.5 iteration for Turbo codes and 136 Mbps@10iteration for LDPC codes. Comparing to other existing structure, the design speed, area have significant advantage.

  • A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation

    Leilei HUANG  Yibo FAN  Chenhao GU  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E102-C No:10
      Page(s):
    756-765

    High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.

  • A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design

    Sha SHEN  Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER-Digital Signal Processing

      Vol:
    E96-A No:7
      Page(s):
    1534-1542

    This paper describes a unified VLSI architecture which can be applied to various types of transforms used in MPEG-2/4, H.264, VC-1, AVS and the emerging new video coding standard named HEVC (High Efficiency Video Coding). A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture. Hadamard transform or 4/8-point DCT/IDCT are used in traditional video coding standards while 16/32-point DCT/IDCT are newly introduced in HEVC. The proposed architecture can support all these transform types in a unified architecture. Two levels (architecture level and block level) of hardware sharing are adopted in this design. In the architecture level, the forward transform can share the hardware resource with the inverse transform. In the block level, the hardware for smaller size transform can be recursively reused by larger size transform. The multiplications of 4 or 8-point transform are implemented with Multiplierless MCM (Multiple Constant Multiplication). In order to reduce the hardware overhead, the multiplications of 16/32 point DCT are implemented with ICM (input-muxed constant multipliers) instead of MCM or regular multipliers. The proposed design is 51% more area efficient than previous work. To the author's knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture.

  • Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

    Wenhua FAN  Chen CHEN  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-B No:4
      Page(s):
    1241-1248

    This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.

  • A 1.5 Gb/s Highly Parallel Turbo Decoder for 3GPP LTE/LTE-Advanced

    Yun CHEN  Xubin CHEN  Zhiyuan GUO  Xiaoyang ZENG  Defeng HUANG  

     
    LETTER-Fundamental Theories for Communications

      Vol:
    E96-B No:5
      Page(s):
    1211-1214

    A highly parallel turbo decoder for 3GPP LTE/LTE-Advanced systems is presented. It consists of 32 radix-4 soft-in/soft-out (SISO) decoders. Each SISO decoder is based on the proposed full-parallel sliding window (SW) schedule. Implemented in a 0.13 µm CMOS technology, the proposed design occupies 12.96 mm2 and achieves 1.5 Gb/s while decoding size-6144 blocks with 5.5 iterations. Compared with conventional SW schedule, the throughput is improved by 30–76% with 19.2% area overhead and negligible energy overhead.

  • A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding

    Yibo FAN  Leilei HUANG  Zheng XIE  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E100-C No:6
      Page(s):
    643-654

    In the newly finalized video coding standard, namely high efficiency video coding (HEVC), new notations like coding unit (CU), prediction unit (PU) and transformation unit (TU) are introduced to improve the coding performance. As a result, the reconstruction loop in intra encoding is heavily burdened to choose the best partitions or modes for them. In order to solve the bottleneck problems in cycle and hardware cost, this paper proposed a high-throughput and compact implementation for such a reconstruction loop. By “high-throughput”, it refers to that it has a fixed throughput of 32 pixel/cycle independent of the TU/PU size (except for 4×4 TUs). By “compact”, it refers to that it fully explores the reusability between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) as well as that between quantization (Q) and de-quantization (IQ). Besides the contributions made in designing related hardware, this paper also provides a universal formula to analyze the cycle cost of the reconstruction loop and proposed a parallel-process scheme to further reduce the cycle cost. This design is verified on the Stratix IV FPGA. The basic structure achieved a maximum frequency of 150MHz and a hardware cost of 64K ALUTs, which could support the real time TU/PU partition decision for 4K×2K@20fps videos.

  • A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs

    Zewen SHI  Xiaoyang ZENG  Zhiyi YU  

     
    PAPER-Computer System

      Vol:
    E94-D No:7
      Page(s):
    1386-1397

    Manufacturing defects in the deep sub-micron VLSI process and aging resulted problems of devices during lifecycle are inevitable, and fault-tolerant routing algorithms are important to provide the required communication for NoCs in spite of failures. The proposed algorithm, referred to as scalable and reconfigurable fault-tolerant distributed routing (RFDR), partitions the system into nine regions using the concept of divide-and-conquer. It is a distributed algorithm, and each router guarantees fault-tolerance within one's own region and the system can be still sustained with multiple fault areas. The proposed RFDR has excellent scalability with hardware cost keeping constant independent of system size. Also it is completely reconfigurable when new nodes fail. Simulations under various synthetic traffic patterns show its better performance compared to Extended-XY routing algorithm. Moreover, there is almost no hardware overhead compared to Logic-Based Distributed Routing (LBDR), but the fault-tolerance capacity is enhanced in the proposed algorithm. Hardware cost is reduced 37% compared to Reconfigurable Distributed Scalable Predictable Interconnect Network (R-DSPIN) which only supports single fault region.

  • An 88/44 Adaptive Hadamard Transform Based FME VLSI Architecture for 4 K2 K H.264/AVC Encoder

    Yibo FAN  Jialiang LIU  Dexue ZHANG  Xiaoyang ZENG  Xinhua CHEN  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    447-455

    Fidelity Range Extension (FRExt) (i.e. High Profile) was added to the H.264/AVC recommendation in the second version. One of the features included in FRExt is the Adaptive Block-size Transform (ABT). In order to conform to the FRExt, a Fractional Motion Estimation (FME) architecture is proposed to support the 88/44 adaptive Hadamard Transform (88/44 AHT). The 88/44 AHT circuit contributes to higher throughput and encoding performance. In order to increase the utilization of SATD (Sum of Absolute Transformed Difference) Generator (SG) in unit time, the proposed architecture employs two 8-pel interpolators (IP) to time-share one SG. These two IPs can work in turn to provide the available data continuously to the SG, which increases the data throughput and significantly reduces the cycles that are needed to process one Macroblock. Furthermore, this architecture also exploits the linear feature of Hadamard Transform to generate the quarter-pel SATD. This method could help to shorten the long datapath in the second-step of two-iteration FME algorithm. Finally, experimental results show that this architecture could be used in the applications requiring different performances by adjusting the supported modes and operation frequency. It can support the real-time encoding of the seven-mode 4 K2 K@24 fps or six-mode 4 K2 K@30 fps video sequences.

  • A High Speed Reconfigurable Face Detection Architecture Based on AdaBoost Cascade Algorithm

    Weina ZHOU  Lin DAI  Yao ZOU  Xiaoyang ZENG  Jun HAN  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    383-391

    Face detection has been an independent technology playing an important role in more and more fields, which makes it necessary and urgent to have its architecture reconfigurable to meet different demands on detection capabilities. This paper proposed a face detection architecture, which could be adjusted by the user according to the background, the sensor resolution, the detection accuracy and speed in different situations. This user adjustable mode makes the reconfiguration simple and efficient, and is especially suitable for portable mobile terminals whose working condition often changes frequently. In addition, this architecture could work as an accelerator to constitute a larger and more powerful system integrated with other functional modules. Experimental results show that the reconfiguration of the architecture is very reasonable in face detection and synthesized report also indicates its advantage on little consumption of area and power.

  • Optimized 2-D SAD Tree Architecture of Integer Motion Estimation for H.264/AVC

    Yibo FAN  Xiaoyang ZENG  Satoshi GOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    411-418

    Integer Motion Estimation (IME) costs much computation in H.264/AVC video encoder. 2-D SAD tree IME architecture provides very high performance for encoder, and it has been used by many video codec designs. This paper proposes an optimized hardware design of 2-D SAD tree IME. Firstly, a new hardware architecture is proposed to reduce on-chip memory size. Secondly, a new search pattern is proposed to fully use memory bandwidth and reduce external memory access. Thirdly, the data-path is redesigned, and the performance is greatly improved. In order to compare with other IME designs, an IME design support D1 size, 30 fps with search range [32, 32] is implemented. The hardware cost of this design includes 118 KGates and 8 Kb SRAM, the maximum clock frequency is 200 MHz. Compared to the original 2-D SAD tree IME, our design saves 87.5% on-chip memory, and achieves 3 times performance than original one. Our design provides a new way to design a low cost and high performance IME for H.264/AVC encoder.

  • An Area-Efficient Reconfigurable LDPC Decoder with Conflict Resolution

    Changsheng ZHOU  Yuebin HUANG  Shuangqu HUANG  Yun CHEN  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    478-486

    Based on Turbo-Decoding Message-Passing (TDMP) and Normalized Min-Sum (NMS) algorithm, an area efficient LDPC decoder that supports both structured and unstructured LDPC codes is proposed in this paper. We introduce a solution to solve the memory access conflict problem caused by TDMP algorithm. We also arrange the main timing schedule carefully to handle the operations of our solution while avoiding much additional hardware consumption. To reduce the memory bits needed, the extrinsic message storing strategy is also optimized. Besides the extrinsic message recover and the accumulate operation are merged together. To verify our architecture, a LDPC decoder that supports both China Multimedia Mobile Broadcasting (CMMB) and Digital Terrestrial/ Television Multimedia Broadcasting (DTMB) standards is developed using SMIC 0.13 µm standard CMOS process. The core area is 4.75 mm2 and the maximum operating clock frequency is 200 MHz. The estimated power consumption is 48.4 mW at 25 MHz for CMMB and 130.9 mW at 50 MHz for DTMB with 5 iterations and 1.2 V supply.

  • A Novel Five-Point Algorithm of Phase Noise Cancellation in DTMB

    Yun CHEN  Xiaoyang ZENG  An PAN  Jing WANG  

     
    LETTER-Digital Signal Processing

      Vol:
    E90-A No:11
      Page(s):
    2608-2611

    A novel five-point algorithm to remove phase noise in Chinese digital terrestrial media broadcasting system is proposed under the assumption that the bandwidth of phase noise is narrow. Simulation results demonstrate that the proposed method can provide 1-3 dB gains in AWGN and 1-4 dB in multi-path compared with those without compensation.

  • Efficient Iterative Frequency Domain Equalization for Single Carrier System with Insufficient Cyclic Prefix

    Chuan WU  Dan BAO  Xiaoyang ZENG  Yun CHEN  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E94-B No:7
      Page(s):
    2174-2177

    In this letter we present efficient iterative frequency domain equalization for single-carrier (SC) transmission systems with insufficient cyclic prefix (CP). Based on minimum mean square error (MMSE) criteria, iterative decision feedback frequency domain equalization (IDF-FDE) combined with cyclic prefix reconstruction (CPR) is derived to mitigate inter-symbol interference (ISI) and inter-carrier interference (ICI). Computer simulation results reveal that the proposed scheme significantly improves the performance of SC systems with insufficient CP compared with previous schemes.

1-20hit(23hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.