Yibo FAN Leilei HUANG Kewei CHEN Xiaoyang ZENG
The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
Mizuki YAMADA Keigo TAKEUCHI Kiyoyuki KOIKE
We propose hardware-aware sum-product (SP) decoding for low-density parity-check codes. To simplify an implementation using a fixed-point number representation, we transform SP decoding in the logarithm domain to that in the decision domain. A polynomial approximation is proposed to implement an update rule of the proposed SP decoding efficiently. Numerical simulations show that the approximate SP decoding achieves almost the same performance as the exact SP decoding when an appropriate degree in the polynomial approximation is used, that it improves the convergence properties of SP and normalized min-sum decoding in the high signal-to-noise ratio regime, and that it is robust against quantization errors.
Leilei HUANG Yibo FAN Chenhao GU Xiaoyang ZENG
High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.
Koichi MITSUNARI Yoshinori TAKEUCHI Masaharu IMAI Jaehoon YU
A significant portion of computational resources of embedded systems for visual detection is dedicated to feature extraction, and this severely affects the detection accuracy and processing performance of the system. To solve this problem, we propose a feature descriptor based on histograms of oriented gradients (HOG) consisting of simple linear algebra that can extract equivalent information to the conventional HOG feature descriptor at a low computational cost. In an evaluation, a leading-edge detection algorithm with this decomposed vector HOG (DV-HOG) achieved equivalent or better detection accuracy compared with conventional HOG feature descriptors. A hardware implementation of DV-HOG occupies approximately 14.2 times smaller cell area than that of a conventional HOG implementation.
Akihide NAGAMINE Kanshiro KASHIKI Fumio WATANABE Jiro HIROKAWA
As one functionality of the wireless distributed network (WDN) enabling flexible wireless networks, it is supposed that a dynamic spectrum access is applied to OFDM systems for superior radio resource management. As a basic technology for such WDN, our study deals with the OFDM signal detection based on its cyclostationary feature. Previous relevant studies mainly relied on software simulations based on the Monte Carlo method. This paper analytically clarifies the relationship between the design parameters of the detector and its detection performance. The detection performance is formulated by using multiple design parameters including the transfer function of the receive filter. A hardware experiment with radio frequency (RF) signals is also carried out by using the detector consisting of an RF unit and FPGA. Thereby, it is verified that the detection characteristics represented by the false-alarm and non-detection probabilities calculated by the analytical formula agree well with those obtained by the hardware experiment. Our analysis and experiment results are useful for the parameter design of the signal detector to satisfy required performance criteria.
Masaki NAKANISHI Miki MATSUYAMA Yumi YOKOO
Quantum computer simulators play an important role when we evaluate quantum algorithms. Quantum computation can be regarded as parallel computation in some sense, and thus, it is suitable to implement a simulator on hardware that can process a lot of operations in parallel. In this paper, we propose a hardware quantum computer simulator. The proposed simulator is based on the register reordering method that shifts and swaps registers containing probability amplitudes so that the probability amplitudes of target basis states can be quickly selected. This reduces the number of large multiplexers and improves clock frequency. We implement the simulator on an FPGA. Experiments show that the proposed simulator has scalability in terms of the number of quantum bits, and can simulate quantum algorithms faster than software simulators.
Engin Cemal MENGÜÇ Nurettin ACIR
The Lyapunov stability theory-based adaptive filter (LST-AF) is a robust filtering algorithm which the tracking error quickly converges to zero asymptotically. Recently, the software module of the LST-AF algorithm is effectively used in engineering applications such as tracking, prediction, noise cancellation and system identification problems. Therefore, hardware implementation becomes necessary in many cases where real time procedure is needed. In this paper, an implementation of the LST-AF algorithm on Field Programmable Gate Arrays (FPGA) is realized for the first time to our knowledge. The proposed hardware implementation on FPGA is performed for two main benchmark problems; i) tracking of an artificial signal and a Henon chaotic signal, ii) estimation of filter parameters using a system identification model. Experimental results are comparatively presented to test accuracy, performance and logic occupation. The results show that our proposed hardware implementation not only conserves the capabilities of software versions of the LST-AF algorithm but also achieves a better performance than them.
Qingyi GU Abdullah AL NOMAN Tadayoshi AOYAMA Takeshi TAKAKI Idaku ISHII
In this paper, we present a high frame rate (HFR) vision system that can automatically control its exposure time by executing brightness histogram-based image processing in real time at a high frame rate. Our aim is to obtain high-quality HFR images for robust image processing of high-speed phenomena even under dynamically changing illumination, such as lamps flickering at 100 Hz, corresponding to an AC power supply at 50 / 60 Hz. Our vision system can simultaneously calculate a 256-bin brightness histogram for an 8-bit gray image of 512×512 pixels at 2000 fps by implementing a brightness histogram calculation circuit module as parallel hardware logic on an FPGA-based high-speed vision platform. Based on the HFR brightness histogram calculation, our method realizes automatic exposure (AE) control of 512×512 images at 2000 fps using our proposed AE algorithm. The proposed AE algorithm can maximize the number of pixels in the effective range of the brightness histogram, thus excluding much darker and brighter pixels, to improve the dynamic range of the captured image without over- and under-exposure. The effectiveness of our HFR system with AE control is evaluated according to experimental results for several scenes with illumination flickering at 100 Hz, which is too fast for the human eye to see.
Yuto NAKANO Kazuhide FUKUSHIMA Shinsaku KIYOMOTO Tsukasa ISHIGURO Yutaka MIYAKE Toshiaki TANAKA Kouichi SAKURAI
KCipher-2 is a word-oriented stream cipher and an ISO/IEC 18033 standard. It is listed as a CRYPTREC cryptographic algorithm for Japanese governmental use. It consists of two feedback shift registers and a non-linear function. The size of each register in KCipher-2 is 32 bits and the non-linear function mainly applies 32-bit operations. Therefore, it can be efficiently implemented as software. SNOW-family stream ciphers are also word-oriented stream ciphers, and their high performance has already been demonstrated.We propose optimised implementations of KCipher-2 and compare their performance to that of the SNOW-family and other eSTREAM portfolios. The fastest algorithm is SNOW 2.0 and KCipher-2 is the second fastest despite the complicated irregular clocking mechanism. However, KCipher-2 is the fastest of the feasible algorithms, as SNOW 2.0 has been shown to have a security flaw. We also optimise the hardware implementation for the Virtex-5 field-programmable gate array (FPGA) and show two implementations. The first implementation is a rather straightforward optimisation and achieves 16,153 Mbps with 732 slices. In the second implementation, we duplicate the non-linear function using the structural advantage of KCipher-2 and we achieve 17,354 Mbps with 813 slices. Our implementation of KCipher-2 is around three times faster than those of the SNOW-family and efficiency, which is evaluated by “Throughput/Area (Mbps/slice)”, is 3.6-times better than that of SNOW 2.0 and 8.5-times better than that of SNOW 3G. These syntheses are performed using Xilinx ISE version 12.4.
Takahiro SUZUKI Takeshi IKENAGA
Scale-Invariant Feature Transform (SIFT) has lately attracted attention in computer vision as a robust keypoint detection algorithm which is invariant for scale, rotation and illumination changes. However, its computational complexity is too high to apply in practical real-time applications. This paper proposes a low complexity keypoint extraction algorithm based on SIFT descriptor and utilization of the database, and its real-time hardware implementation for Full-HD resolution video. The proposed algorithm computes SIFT descriptor on the keypoint obtained by corner detection and selects a scale from the database. It is possible to parallelize the keypoint detection and descriptor computation modules in the hardware. These modules do not depend on each other in the proposed algorithm in contrast with SIFT that computes a scale. The processing time of descriptor computation in this hardware is independent of the number of keypoints because its descriptor generation is pipelining structure of pixel. Evaluation results show that the proposed algorithm on software is 12 times faster than SIFT. Moreover, the proposed hardware on FPGA is 427 times faster than SIFT and 61 times faster than the proposed algorithm on software. The proposed hardware performs keypoint extraction and matching at 60 fps for Full-HD video.
Kazuhiko MITSUYAMA Tetsuomi IKEDA Tomoaki OHTSUKI
Multiple-input multiple-output (MIMO) systems with antenna selection are practical in that they can alleviate the computational complexity at the receiver and achieve good reception performance. Channel correlation, not just carrier-to-noise ratio (CNR), has a great impact on reception performance in MIMO channels. We propose a practical receive antenna subset selection algorithm with reduced complexity that uses the condition number of the partial channel matrix and a predetermined CNR threshold. This paper describes the algorithm and its performance evaluation by both computer simulation and indoor experiments using a prototype receiver and received signals obtained in an actual mobile outdoor experiment. The results confirm that our proposed method provides good bit error rate performance by setting the CNR threshold properly.
This paper proposes an enhanced feature detection method for the OFDM signals of digital TV (DTV) standards, namely Digital Video Broadcasting-Terrestrial (DVB-T) and Integrated Services Digital Broadcasting-Terrestrial (ISDB-T). The proposed method exploits property of time-domain sliding correlation results of DTV signals with the pilots that are inserted into OFDM symbols. Some correlation outputs are much larger than the remaining outputs and are called correlation peaks here, and, the distance between their positions in the correlation output sequence keep constant regardless of the received DTV timings. The proposed method then derives sensing test statistic with improved SNR by aggregating the correlation peaks based on their positions. Performance of the proposed method is evaluated by both computer simulation and hardware implementation. Simulation results for DVB-T detection verify that compared to the optimal conventional sensing method, the proposed method achieves superior sensing performance. It reduces sampling time by about 25% for the same sensing performance while increasing computational complexity by around 0.0001%. Hardware performance further verifies that the proposed method is able to accurately detect ISDB-T at the low SNR of -14.5 dB by employing 8 OFDM symbol durations of samples.
Jun GAO Minxuan ZHANG Zuocheng XING Chaochao FENG
This paper proposes a Reduced Explicitly Parallel Instruction Computing Processor (REPICP) which is an independently designed, 64-bit, general-purpose microprocessor. The REPICP based on EPIC architecture overcomes the disadvantages of hardware-based superscalar and software-based Very Long Instruction Word (VLIW) and utilizes the cooperation of compiler and hardware to enhance Instruction-Level Parallelism (ILP). In REPICP, we propose the Optimized Lock-Step execution Model (OLSM) and instruction control pipeline method. We also propose reduced innovative methods to optimize the design. The REPICP is fabricated in Artisan 0.13 µm Nominal 1P8M process with 57 M transistors. The die size of the REPICP is 100 mm2 (1010), and consumes only 12 W power when running at 300 MHz.
Chunyi SONG Mohammad Azizur RAHMAN Hiroshi HARADA
This paper proposes a sensing method for TV signals of DVB-T standard to realize effective TV White Space (TVWS) Communication. In the TVWS technology trial organized by the Infocomm Development Authority (iDA) of Singapore, with regard to the sensing level and sensing time, detecting DVB-T signal at the level of -120 dBm over an 8 MHz channel with a sensing time below 1 second is required. To fulfill such a strict sensing requirement, we propose a smart sensing method which combines feature detection and energy detection (CFED), and is also characterized by using dynamic threshold selection (DTS) based on a threshold table to improve sensing robustness to noise uncertainty. The DTS based CFED (DTS-CFED) is evaluated by computer simulations and is also implemented into a hardware sensing prototype. The results show that the DTS-CFED achieves a detection probability above 0.9 for a target false alarm probability of 0.1 for DVB-T signals at the level of -120 dBm over an 8 MHz channel with the sensing time equals to 0.1 second.
Sanghoon KWAK Jinwook KIM Dongsoo HAR
The intra-prediction unit is an essential part of H.264 codec, since it reduces the amount of data to be encoded by predicting pixel values (luminance and chrominance) from their neighboring blocks. A dedicated hardware implementation for the intra-prediction unit is required for real-time encoding and decoding of high resolution video data. To develop a cost-effective intra-prediction unit this paper proposes a novel architecture of intra-predictor generator, the core part of intra-prediction unit. The proposed intra-predictor generator enables the intra-prediction unit to achieve significant clock cycle reduction with approximately the same gate count, as compared to Huang's work [3].
Rong CHEN Xun FAN Youyun XU Haibin ZHANG
Iterative receivers, which perform MMSE detection and decoding iteratively, can provide significant performance improvement compared with noniterative method. However, due to the high computational cost and numerical instability, conventional MMSE detection using a priori information can not be implemented in hardware. In this letter, we propose a newly-built iterative receiver which is division-free and numerically stable, and then we analyze the results of a fixed-point simulation and present the hardware implementation architecture.
This paper proposed a new watermarking algorithm and implementation in hardware, by which the watermarking process and an image compression process can operate in conjunction, in parallel, and/or without degrading the performance of the compression process. The goal of the proposed watermarking scheme is to provide the bases to insist the ownership and to authenticate integrity of the watermark-embedded image by detecting the errors and their positions without the original image (blind watermarking). Our watermarking scheme is to replace the watermark with one or several bit-plane(s) of the DC subband after 2DDWT (2-Dimensional Discrete Wavelet Transform) decomposition which is the basic transformation in DWT-based image compression such as JPEG2000. If more than one bit-plane is involved, the position to embed each watermark bit is randomly selected among the bit-planes by a random number generated with an LFSR (Linear Feedback Shift Register). Experimental results showed that for all the considered attacks except the high compression by JPEG, the error ratios in the extracted watermarks by our algorithm were below 3% and the extracted watermarks were unambiguously recognizable in all the cases. The hardware (FPGA)-implemented result could operate stably in 82 MHz clock frequency. This hardware was merged to DWT-based image compression codec which runs in a real-time in 66 MHz of clock frequency. This resulted in the real-time operation for codec and watermarking together in 66 MHz of clock frequency. The watermarking scheme used 4,037 LABs (24%) of the hardware resource of APEX20KC EP20K400CF672-7 from Altera.
In this paper a low-complexity and high-resolution algorithm to estimate the magnitude of complex numbers is presented. Starting from a review of previous art, the new algorithm has been derived to improve precision performance without any penalty in hardware complexity. As a case example, a semi-custom VLSI implementation for 10 bit 2's complement input data has been performed. A mean square error and mean error performance improvement of nearly one order of magnitude has been demonstrated for an hardware complexity increase of roughly 34% with respect to previously presented solutions.
Kei SAKAGUCHI Jun-ichi TAKADA Kiyomichi ARAKI
Implementation of Multi-Input Multi-Output (MIMO) channel sounder is considered, taking hardware cost and realtime measurement into account. A remarkable difference between MIMO and conventional Single-Input Multi-Output (SIMO) channel sounding is that the MIMO sounder needs some kind of multiplexing to distinguish transmitting antennas. We compared three types of multiplexing TDM, FDM, and CDM for the sounding purpose, then we chose FDM based technique to achieve cost effectiveness and realtime measurement. In the framework of FDM, we have proposed an algorithm to estimate MIMO channel parameters. Furthermore the proposed algorithm was implemented into the hardware, and the validity of the proposed algorithm was evaluated through measurements in an anechoic chamber.
Kazuo TANADA Hiroshi KUBO Atsushi IWASE Makoto MIYAKE
This paper proposes an adaptive list-output Viterbi equalizer (LVE) with fast compare-select operation, in order to achieve a good trade-off between bit error rate (BER) performance and processing speed. An LVE, which keeps several survivors for each state, has good BER performance in the presence of wide-spread intersymbol interference. However, the LVE suffers from large processing delay due to its sorting-based compare-select operation. The proposed adaptive LVE greatly reduces its processing delay, because it simplifies compare-select operation. In addition, computer simulation shows that the proposed LVE causes only slight BER performance degradation due to its simplification of compare-select operation. Thus, the proposed LVE achieves better BER performance than decision-feedback sequence estimation (DFSE) without an increase in processing delay.