Keyword Search Result

[Keyword] accelerator(32hit)

1-20hit(32hit)

  • MITA: Multi-Input Adaptive Activation Function for Accurate Binary Neural Network Hardware

    Peiqi ZHANG  Shinya TAKAMAEDA-YAMAZAKI  

     
    PAPER

      Pubricized:
    2023/05/24
      Vol:
    E106-D No:12
      Page(s):
    2006-2014

    Binary Neural Networks (BNN) have binarized neuron and connection values so that their accelerators can be realized by extremely efficient hardware. However, there is a significant accuracy gap between BNNs and networks with wider bit-width. Conventional BNNs binarize feature maps by static globally-unified thresholds, which makes the produced bipolar image lose local details. This paper proposes a multi-input activation function to enable adaptive thresholding for binarizing feature maps: (a) At the algorithm level, instead of operating each input pixel independently, adaptive thresholding dynamically changes the threshold according to surrounding pixels of the target pixel. When optimizing weights, adaptive thresholding is equivalent to an accompanied depth-wise convolution between normal convolution and binarization. Accompanied weights in the depth-wise filters are ternarized and optimized end-to-end. (b) At the hardware level, adaptive thresholding is realized through a multi-input activation function, which is compatible with common accelerator architectures. Compact activation hardware with only one extra accumulator is devised. By equipping the proposed method on FPGA, 4.1% accuracy improvement is achieved on the original BNN with only 1.1% extra LUT resource. Compared with State-of-the-art methods, the proposed idea further increases network accuracy by 0.8% on the Cifar-10 dataset and 0.4% on the ImageNet dataset.

  • Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform

    Van-Cam NGUYEN  Yasuhiko NAKASHIMA  

     
    PAPER-Computer System

      Pubricized:
    2023/03/17
      Vol:
    E106-D No:6
      Page(s):
    1117-1129

    Many deep convolutional neural network (CNN) inference accelerators on the field-programmable gate array (FPGA) platform have been widely adopted due to their low power consumption and high performance. In this paper, we develop the following to improve performance and power efficiency. First, we use a high bandwidth memory (HBM) to expand the bandwidth of data transmission between the off-chip memory and the accelerator. Second, a fully-pipelined manner, which consists of pipelined inter-layer computation and a pipelined computation engine, is implemented to decrease idle time among layers. Third, a multi-core architecture with shared-dual buffers is designed to reduce off-chip memory access and maximize the throughput. We designed the proposed accelerator on the Xilinx Alveo U280 platform with in-depth Verilog HDL instead of high-level synthesis as the previous works and explored the VGG-16 model to verify the system during our experiment. With a similar accelerator architecture, the experimental results demonstrate that the memory bandwidth of HBM is 13.2× better than DDR4. Compared with other accelerators in terms of throughput, our accelerator is 1.9×/1.65×/11.9× better than FPGA+HBM2 based/low batch size (4) GPGPU/low batch size (4) CPU. Compared with the previous DDR+FPGA/DDR+GPGPU/DDR+CPU based accelerators in terms of power efficiency, our proposed system provides 1.4-1.7×/1.7-12.6×/6.6-37.1× improvement with the large-scale CNN model.

  • Multilayer Perceptron Training Accelerator Using Systolic Array

    Takeshi SENOO  Akira JINGUJI  Ryosuke KURAMOCHI  Hiroki NAKAHARA  

     
    PAPER

      Pubricized:
    2022/07/21
      Vol:
    E105-D No:12
      Page(s):
    2048-2056

    Multilayer perceptron (MLP) is a basic neural network model that is used in practical industrial applications, such as network intrusion detection (NID) systems. It is also used as a building block in newer models, such as gMLP. Currently, there is a demand for fast training in NID and other areas. However, in training with numerous GPUs, the problems of power consumption and long training times arise. Many of the latest deep neural network (DNN) models and MLPs are trained using a backpropagation algorithm which transmits an error gradient from the output layer to the input layer such that in the sequential computation, the next input cannot be processed until the weights of all layers are updated from the last layer. This is known as backward locking. In this study, a weight parameter update mechanism is proposed with time delays that can accommodate the weight update delay to allow simultaneous forward and backward computation. To this end, a one-dimensional systolic array structure was designed on a Xilinx U50 Alveo FPGA card in which each layer of the MLP is assigned to a processing element (PE). The time-delay backpropagation algorithm executes all layers in parallel, and transfers data between layers in a pipeline. Compared to the Intel Core i9 CPU and NVIDIA RTX 3090 GPU, it is 3 times faster than the CPU and 2.5 times faster than the GPU. The processing speed per power consumption is 11.5 times better than that of the CPU and 21.4 times better than that of the GPU. From these results, it is concluded that a training accelerator on an FPGA can achieve high speed and energy efficiency.

  • RVCar: An FPGA-Based Simple and Open-Source Mini Motor Car System with a RISC-V Soft Processor

    Takuto KANAMORI  Takashi ODAN  Kazuki HIROHATA  Kenji KISE  

     
    PAPER

      Pubricized:
    2022/08/09
      Vol:
    E105-D No:12
      Page(s):
    1999-2007

    Deep Neural Network (DNN) is widely used for computer vision tasks, such as image classification, object detection, and segmentation. DNN accelerator on FPGA and especially Convolutional Neural Network (CNN) is a hot topic. More research and education should be conducted to boost this field. A starting point is required to make it easy for new entrants to join this field. We believe that FPGA-based Autonomous Driving (AD) motor cars are suitable for this because DNN accelerators can be used for image processing with low latency. In this paper, we propose an FPGA-based simple and open-source mini motor car system named RVCar with a RISC-V soft processor and a CNN accelerator. RVCar is suitable for the new entrants who want to learn the implementation of a CNN accelerator and the surrounding system. The motor car consists of Xilinx Nexys A7 board and simple parts. All modules except the CNN accelerator are implemented in Verilog HDL and SystemVerilog. The CNN accelerator is converted from a PyTorch model by our tool. The accelerator is written in C++, synthesizable by Vitis HLS, and an easy-to-customize baseline for the new entrants. FreeRTOS is used to implement AD algorithms and executed on the RISC-V soft processor. It helps the users to develop the AD algorithms efficiently. We conduct a case study of the simple AD task we define. Although the task is simple, it is difficult to achieve without image recognition. We confirm that RVCar can recognize objects and make correct decisions based on the results.

  • A Low-Latency Inference of Randomly Wired Convolutional Neural Networks on an FPGA

    Ryosuke KURAMOCHI  Hiroki NAKAHARA  

     
    PAPER

      Pubricized:
    2021/06/24
      Vol:
    E104-D No:12
      Page(s):
    2068-2077

    Convolutional neural networks (CNNs) are widely used for image processing tasks in both embedded systems and data centers. In data centers, high accuracy and low latency are desired for various tasks such as image processing of streaming videos. We propose an FPGA-based low-latency CNN inference for randomly wired convolutional neural networks (RWCNNs), whose layer structures are based on random graph models. Because RWCNNs have several convolution layers that have no direct dependencies between them, our architecture can process them efficiently using a pipeline method. At each layer, we need to use the calculation results of multiple layers as the input. We use an FPGA with HBM2 to enable parallel access to the input data with multiple HBM2 channels. We schedule the order of execution of the layers to improve the pipeline efficiency. We build a conflict graph using the scheduling results. Then, we allocate the calculation results of each layer to the HBM2 channels by coloring the graph. Because the pipeline execution needs to be properly controlled, we developed an automatic generation tool for hardware functions. We implemented the proposed architecture on the Alveo U50 FPGA. We investigated a trade-off between latency and recognition accuracy for the ImageNet classification task by comparing the inference performances for different input image sizes. We compared our accelerator with a conventional accelerator for ResNet-50. The results show that our accelerator reduces the latency by 2.21 times. We also obtained 12.6 and 4.93 times better efficiency than CPU and GPU, respectively. Thus, our accelerator for RWCNNs is suitable for low-latency inference.

  • FCA-BNN: Flexible and Configurable Accelerator for Binarized Neural Networks on FPGA

    Jiabao GAO  Yuchen YAO  Zhengjie LI  Jinmei LAI  

     
    PAPER-Biocybernetics, Neurocomputing

      Pubricized:
    2021/05/19
      Vol:
    E104-D No:8
      Page(s):
    1367-1377

    A series of Binarized Neural Networks (BNNs) show the accepted accuracy in image classification tasks and achieve the excellent performance on field programmable gate array (FPGA). Nevertheless, we observe existing designs of BNNs are quite time-consuming in change of the target BNN and acceleration of a new BNN. Therefore, this paper presents FCA-BNN, a flexible and configurable accelerator, which employs the layer-level configurable technique to execute seamlessly each layer of target BNN. Initially, to save resource and improve energy efficiency, the hardware-oriented optimal formulas are introduced to design energy-efficient computing array for different sizes of padded-convolution and fully-connected layers. Moreover, to accelerate the target BNNs efficiently, we exploit the analytical model to explore the optimal design parameters for FCA-BNN. Finally, our proposed mapping flow changes the target network by entering order, and accelerates a new network by compiling and loading corresponding instructions, while without loading and generating bitstream. The evaluations on three major structures of BNNs show the differences between inference accuracy of FCA-BNN and that of GPU are just 0.07%, 0.31% and 0.4% for LFC, VGG-like and Cifar-10 AlexNet. Furthermore, our energy-efficiency results achieve the results of existing customized FPGA accelerators by 0.8× for LFC and 2.6× for VGG-like. For Cifar-10 AlexNet, FCA-BNN achieves 188.2× and 60.6× better than CPU and GPU in energy efficiency, respectively. To the best of our knowledge, FCA-BNN is the most efficient design for change of the target BNN and acceleration of a new BNN, while keeps the competitive performance.

  • Weight Compression MAC Accelerator for Effective Inference of Deep Learning Open Access

    Asuka MAKI  Daisuke MIYASHITA  Shinichi SASAKI  Kengo NAKATA  Fumihiko TACHIBANA  Tomoya SUZUKI  Jun DEGUCHI  Ryuichi FUJIMOTO  

     
    PAPER-Integrated Electronics

      Pubricized:
    2020/05/15
      Vol:
    E103-C No:10
      Page(s):
    514-523

    Many studies of deep neural networks have reported inference accelerators for improved energy efficiency. We propose methods for further improving energy efficiency while maintaining recognition accuracy, which were developed by the co-design of a filter-by-filter quantization scheme with variable bit precision and a hardware architecture that fully supports it. Filter-wise quantization reduces the average bit precision of weights, so execution times and energy consumption for inference are reduced in proportion to the total number of computations multiplied by the average bit precision of weights. The hardware utilization is also improved by a bit-parallel architecture suitable for granularly quantized bit precision of weights. We implement the proposed architecture on an FPGA and demonstrate that the execution cycles are reduced to 1/5.3 for ResNet-50 on ImageNet in comparison with a conventional method, while maintaining recognition accuracy.

  • A High Performance FPGA-Based Sorting Accelerator with a Data Compression Mechanism

    Ryohei KOBAYASHI  Kenji KISE  

     
    PAPER-Computer System

      Pubricized:
    2017/01/30
      Vol:
    E100-D No:5
      Page(s):
    1003-1015

    Sorting is an extremely important computation kernel that has been accelerated in a lot of fields such as databases, image processing, and genome analysis. Given that advent of Internet of Things (IoT) era due to mobile technology progressions, the future needs a sorting method that is available on any environment, such as not only high performance systems like servers but also low computational performance machines like embedded systems. In this paper, we present an FPGA-based sorting accelerator combining Sorting Network and Merge Sorter Tree, which is customizable by means of tuning design parameters. The proposed FPGA accelerator sorts data sent from a host PC via the PCIe bus, and sends back the fully sorted data sequence to it. We also present a detailed analytical model that accurately estimates the sorting performance. Due to these characteristics, designers can know how fast a developed sorting hardware is in advance and can implement the best one to fulfill the cost and performance constraints. Our experiments show that the proposed hardware achieves up to 19.5x sorting performance, compared with Intel Core i7-3770K operating at 3.50GHz, when sorting 256M 32-bits integer elements. However, this result is limited because of insufficient memory bandwidth. To overcome this problem, we propose a data compression mechanism and the experimental result shows that the sorting hardware with it achieves almost 90% of the estimated performance, while the hardware without it does about 60%. In order to allow every designer to easily and freely use this accelerator, the RTL source code is released as open-source hardware.

  • Initial Value Problem Formulation TDBEM with 4-D Domain Decomposition Method and Application to Wake Fields Analysis

    Hideki KAWAGUCHI  Thomas WEILAND  

     
    PAPER

      Vol:
    E100-C No:1
      Page(s):
    37-44

    The Time Domain Boundary Element Method (TDBEM) has its advantages in the analysis of transient electromagnetic fields (wake fields) induced by a charged particle beam with curved trajectory in a particle accelerator. On the other hand, the TDBEM has disadvantages of huge required memory and computation time compared with those of the Finite Difference Time Domain (FDTD) method or the Finite Integration Technique (FIT). This paper presents a comparison of the FDTD method and 4-D domain decomposition method of the TDBEM based on an initial value problem formulation for the curved trajectory electron beam, and application to a full model simulation of the bunch compressor section of the high-energy particle accelerators.

  • Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

    Yoshikazu INAGAKI  Shinya TAKAMAEDA-YAMAZAKI  Jun YAO  Yasuhiko NAKASHIMA  

     
    PAPER-Architecture

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2141-2149

    The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.

  • Data-Transfer-Aware Design of an FPGA-Based Heterogeneous Multicore Platform with Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:12
      Page(s):
    2658-2669

    For an FPGA-based heterogeneous multicore platform, we present the design methodology to reduce the total processing time considering data-transfer. The reconfigurability of recent FPGAs with hard CPU cores allows us to realize a single-chip heterogeneous processor optimized for a given application. The major problem in designing such heterogeneous processors is data-transfer between CPU cores and accelerator cores. The total processing time with data-transfers is modeled considering the overlap of computation time and data-transfer time, and optimal design parameters are searched for.

  • New Directions for a Japanese Academic Backbone Network Open Access

    Shigeo URUSHIDANI  Shunji ABE  Kenjiro YAMANAKA  Kento AIDA  Shigetoshi YOKOYAMA  Hiroshi YAMADA  Motonori NAKAMURA  Kensuke FUKUDA  Michihiro KOIBUCHI  Shigeki YAMADA  

     
    INVITED PAPER

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    546-556

    This paper describes an architectural design and related services of a new Japanese academic backbone network, called SINET5, which will be launched in April 2016. The network will cover all 47 prefectures with 100-Gigabit Ethernet technology and connect each pair of prefectures with a minimized latency. This will enable users to leverage evolving cloud-computing powers as well as draw on a high-performance platform for data-intensive applications. The transmission layer will form a fully meshed, SDN-friendly, and reliable network. The services will evolve to be more dynamic and cloud-oriented in response to user demands. Cyber-security measures for the backbone network and tools for performance acceleration and visualization are also discussed.

  • Design and Implement of High Performance Crypto Coprocessor

    Shice NI  Yong DOU  Kai CHEN  Jie ZHOU  

     
    LETTER-Algorithms and Data Structures

      Vol:
    E97-A No:4
      Page(s):
    989-990

    This letter proposes a novel high performance crypto coprocessor that relies on Reconfigurable Cryptographic Blocks. We implement the prototype of the coprocessor on Xilinx FPGA chip. And the pipelining technique is adopted to realize data paralleling. The results show that the coprocessor, running at 189MHz, outperforms the software-based SSL protocol.

  • A Reconfigurable Data-Path Accelerator Based on Single Flux Quantum Circuits Open Access

    Hiroshi KATAOKA  Hiroaki HONDA  Farhad MEHDIPOUR  Nobuyuki YOSHIKAWA  Akira FUJIMAKI  Hiroyuki AKAIKE  Naofumi TAKAGI  Kazuaki MURAKAMI  

     
    INVITED PAPER

      Vol:
    E97-C No:3
      Page(s):
    141-148

    The single flux quantum (SFQ) is expected to be a next-generation high-speed and low-power technology in the field of logic circuits. CMOS as the dominant technology for conventional processors cannot be replaced with SFQ technology due to the difficulty of implementing feedback loops and conditional branches using SFQ circuits. This paper investigates the applicability of a reconfigurable data-path (RDP) accelerator based on SFQ circuits. The authors introduce detailed specifications of the SFQ-RDP architecture and compare its performance and power/performance ratio with those of a graphics-processing unit (GPU). The results show at most 1600 times higher efficiency in terms of Flops/W (floating-point operations per second/Watt) for some high-performance computing application programs.

  • High-Speed Fully-Adaptable CRC Accelerators

    Amila AKAGIC  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E96-D No:6
      Page(s):
    1299-1308

    Cyclic Redundancy Check (CRC) is a well known error detection scheme used to detect corruption of digital content in digital networks and storage devices. Since it is a compute-intensive process which adversely affects performance, hardware acceleration using FPGAs has been tried and satisfactory performance has been achieved. However, recent extended usage of networks and storage systems require various correction capabilities for various CRC standards. Traditional hardware designs based on the LFSR (Linear Feedback Shift Register) tend to have fixed structure without such flexibility. Here, fully-adaptable CRC accelerator based on a table-based algorithm is proposed. The table-based algorithm is a flexible method commonly used in software implementations. It has been rarely implemented with the hardware, since it is believed that the operational speed is not enough. However, by using pipelined structure and efficient use of memory modules in FPGAs, it appeared that the table-based fixed CRC accelerators achieved better performance than traditional implementation. Based on the implementation, fully-adaptable CRC accelerator which eliminate the need for many non-adaptable CRC implementations is proposed. The accelerator has ability to process arbitrary number of input data and generates CRC for any known CRC standard, up to 65 bits of generator polynomial, during run-time. Further, we modify Table generation algorithm in order to decrease its space complexity from O(nm) to O(n). On Xilinx Virtex 6 LX550T board, the fully-adaptable accelerators occupy between 1 to 2% area to produce maximum of 289.8 Gbps at 283.1 MHz if BRAM is deployed, or between 1.6 - 14% of area for 418 Gbps at 408.9 MHz if tables are implemented in logic. Proposed architecture enables further expansion of throughput by increasing a number of input bits M processed at a time.

  • Design and Implementation of a Handshake Join Architecture on FPGA

    Yasin OGE  Takefumi MIYOSHI  Hideyuki KAWASHIMA  Tsutomu YOSHINAGA  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2919-2927

    A novel design is proposed to implement highly parallel stream join operators on a field-programmable gate array (FPGA), by examining handshake join algorithm for hardware implementation. The proposed design is evaluated in terms of the hardware resource usage, the maximum clock frequency, and the performance. Experimental results indicate that the proposed implementation can handle considerably high input rates, especially at low match rates. Results of simulation conducted to optimize size of buffers included in join and merge units give a new intuition regarding static and adaptive buffer tuning in handshake join.

  • Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation

    Yoshitaka HIRAMATSU  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Toru NOJIRI  Kunio UCHIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E95-C No:12
      Page(s):
    1872-1882

    The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.

  • A Processor Accelerator for Software Decoding of Reed-Solomon Codes

    Kazuhito ITO  Keisuke NASU  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E95-A No:5
      Page(s):
    884-893

    Decoding of Reed-Solomon (RS) codes requires many arithmetic operations in the Galois field. While the software decoding of RS codes has the advantage of its flexibility to support RS codes of variable parameters, the speed of the software decoding is slower than dedicated hardware RS decoders because arithmetic operations in the Galois field on an ordinary processor require many instruction steps. To achieve fast software decoding of RS codes, it is effective to accelerate Galois operations by both dedicated circuitry and parallel processing. In this paper, an accelerator is proposed which is attached to the base processor to speed up the software decoding of RS codes by parallel execution of Galois operations.

  • An Energy Efficient Sensor Network Processor with Latency-Aware Adaptive Compression

    Yongpan LIU  Shuangchen LI  Jue WANG  Beihua YING  Huazhong YANG  

     
    PAPER-Integrated Electronics

      Vol:
    E94-C No:7
      Page(s):
    1220-1228

    This paper proposed a novel platform for sensor nodes to resolve the energy and latency challenges. It consists of a processor, an adaptive compressing module and several compression accelerators. We completed the proposed chip in a 0.18µm HJTC CMOS technology. Compared to the software-based solution, the hardware-assisted compression reduces over 98% energy and 212% latency. Besides, we balanced the energy and latency metric using an adaptive module. According to the scheduling algorithm, the module tunes the state of the compression accelerator, as well as the sampling frequency of the online sensor. For example, given a 9µs constraint for a 1-byte operation, it reduces 34% latency while the energy overheads are less than 5%.

  • A Processor Accelerator for Software Decoding of BCH Codes

    Kazuhito ITO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E93-A No:7
      Page(s):
    1329-1337

    The BCH code is one of the well-known error correction codes and its decoding contains many operations in Galois field. These operations require many instruction steps or large memory area for look-up tables on ordinary processors. While dedicated hardware BCH decoders achieves higher decoding speed than software, the advantage of software decoding is its flexibility to decode BCH codes of variable parameters. In this paper, an auxiliary circuit to be embedded in a pipelined processor is proposed which accelerates software decoding of various BCH codes.

1-20hit(32hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.