Author Search Result

[Author] Yuanwu LEI(3hit)

1-3hit
  • Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

    Lei GUO  Yuhua TANG  Yong DOU  Yuanwu LEI  Meng MA  Jie ZHOU  

     
    PAPER-Computer System

      Vol:
    E96-D No:12
      Page(s):
    2765-2775

    The effective bandwidth of the dynamic random-access memory (DRAM) for the alternate row-wise/column-wise matrix access (AR/CMA) mode, which is a basic characteristic in scientific and engineering applications, is very low. Therefore, we propose the window memory layout scheme (WMLS), which is a matrix layout scheme that does not require transposition, for AR/CMA applications. This scheme maps one row of a logical matrix into a rectangular memory window of the DRAM to balance the bandwidth of the row- and column-wise matrix access and to increase the DRAM IO bandwidth. The optimal window configuration is theoretically analyzed to minimize the total number of no-data-visit operations of the DRAM. Different WMLS implementationsare presented according to the memory structure of field-programmable gata array (FPGA), CPU, and GPU platforms. Experimental results show that the proposed WMLS can significantly improve DRAM bandwidth for AR/CMA applications. achieved speedup factors of 1.6× and 2.0× are achieved for the general-purpose CPU and GPU platforms, respectively. For the FPGA platform, the WMLS DRAM controller is custom. The maximum bandwidth for the AR/CMA mode reaches 5.94 GB/s, which is a 73.6% improvement compared with that of the traditional row-wise access mode. Finally, we apply WMLS scheme for Chirp Scaling SAR application, comparing with the traditional access approach, the maximum speedup factors of 4.73X, 1.33X and 1.56X can be achieved for FPGA, CPU and GPU platform, respectively.

  • FPGA-Specific Custom VLIW Architecture for Arbitrary Precision Floating-Point Arithmetic

    Yuanwu LEI  Yong DOU  Jie ZHOU  

     
    PAPER-Computer System

      Vol:
    E94-D No:11
      Page(s):
    2173-2183

    Many scientific applications require efficient variable-precision floating-point arithmetic. This paper presents a special-purpose Very Large Instruction Word (VLIW) architecture for variable precision floating-point arithmetic (VV-Processor) on FPGA. The proposed processor uses a unified hardware structure, equipped with multiple custom variable-precision arithmetic units, to implement various variable-precision algebraic and transcendental functions. The performance is improved through the explicitly parallel technology of VLIW instruction and by dynamically varying the precision of intermediate computation. We take division and exponential function as examples to illustrate the design of variable-precision elementary algorithms in VV-Processor. Finally, we create a prototype of VV-Processor unit on a Xilinx XC6VLX760-2FF1760 FPGA chip. The experimental results show that one VV-Processor unit, running at 253 MHz, outperforms the approach of a software-based library running on an Intel Core i3 530 CPU at 2.93 GHz by a factor of 5X-37X for basic variable-precision arithmetic operations and elementary functions.

  • Design and Implementation of the Parameterized Multi-Standard High-Throughput Radix-4 Viterbi Decoder on FPGA

    Rongchun LI  Yong DOU  Yuanwu LEI  Shice NI  Song GUO  

     
    PAPER-Fundamental Theories for Communications

      Vol:
    E95-B No:5
      Page(s):
    1602-1611

    This paper presents a parameterized multi-standard adaptive radix-4 Viterbi decoder with high throughput and low complexity. The proposed Viterbi decoder supports constraint lengths ranging from 3-9, code rates in the range of 1/2-1/3, and arbitrary truncation lengths. We present a novel fabric of Add-Compare-Select Unit (ACSU) and methods of unsigned quantization and efficient normalization that shorten the critical path. The decoder achieves a low bit error ratio in multiple standards, such as GPRS, WiMax, LTE, CDMA, and 3G. The proposed decoder is implemented on Xilinx XC5VLX330 device and the frequency achieved is 181.7 MHz. The throughput of the proposed decoder can reach 363 Mbps, which is superior to the other current multi-standard Viterbi decoders or radix-4 Viterbi decoders on the FPGA platform.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.