Katsumi DOSAKA Daisuke OGAWA Takahito KUSUMOTO Masayuki MIYAMA Yoshio MATSUDA
Architecture of a low power Ternary Content Addressable Memory (TCAM) is proposed. The TCAM is a powerful engine for search and sort processing, but it has two serious problems, large power consumption and large power line noise. To solve these problems, we have developed a charge recycling scheme for match lines and search lines. A combination of the newly introduced PMOS CAM cell together with the conventional NMOS CAM cell realizes match line charge recycling. A checkerboard arrangement of the NMOS and the PMOS cell array enables search line charge recycling. By using these technologies, the power consumption of the TCAM can be reduced to 50% of conventional designs, and as a result, the power line noise is also reduced. An experimental chip has been fabricated in 180-nm 6-metal process. The power consumption of this chip is 6.3 fJ/bit/search, which is half of the conventional scheme.
Harufusa KONDOH Hiromi NOTANI Tsutomu YOSHIMURA Hiroshi SHIBATA Yoshio MATSUDA
A new approach which implements a simple, high-speed phase detector with precharge logic will be presented. The minimum detectable phase difference is 40 psec, which is less than a half of conventional detectors. A current mode ring oscillator with a complementary-input bias generator has also been developed to enhance the dynamic range of the VCO under a low supply voltage. A fully CMOS PLL was designed using 0.5-µm technology. By virtue of this simple, fast detector, the wide operation range of 250 MHz at 1.5 V to 622 MHz at 3.0 V was achieved by simulation.
Yu SUZUKI Masato ITO Satoshi KANDA Kousuke IMAMURA Yoshio MATSUDA Tetsuya MATSUMURA
This paper describes the design and implementation of a real-time optical flow processor using a single field-programmable gate array (FPGA) chip. By introducing the modified initial flow generation method, the successive over-relaxation (SOR) method for both layers, the optimization of the reciprocal operation method, and the image division method, it is now possible to both reduce hardware requirements and improve flow accuracy. Additionally, by introducing a pipeline structure to this processor, high-throughput hardware implementation could be achieved. Total logic cell (LC) amounts and processer memory capacity are reduced by about 8% and 16%, respectively, compared to our previous hierarchical optical flow estimation (HOE) processor. The results of our evaluation confirm that this processor can perform 30 fps wide extended graphics array (WXGA) 175.7MHz real-time optical flow processing with a single FPGA.
Harufusa KONDOH Hideaki YAMANAKA Masahiko ISHIWAKI Yoshio MATSUDA Masao NAKAYA
A new approach to implement queues for controlling ATM switch LSI is presented. In many conventional architecture, external FIFOs are provided for each output link and used to manage the address of the buffer in an ATM switch. We reduce the number of FIFOs by using a self-timed queue with a search circuit that finds the earliest entry for each output link. Using this architecture, number of the FIFOs is reduced to 1/N, where N is the switch size. Delay priority and multicasting can be supported without doubling the number of the queues. This new queue can also be utilized as an ATM switch by itself. Evaluation chip was fabricated using 0.5-µm CMOS process technology. Inter-stage transfer speed over 500 MHz and cycle time over 125 MHz was obtained. This performance is enough for a 622-Mbps 1616 ATM Switch.
Reo AOKI Kousuke IMAMURA Akihiro HIRANO Yoshio MATSUDA
Recently, Super-resolution convolutional neural network (SRCNN) is widely known as a state of the art method for achieving single-image super resolution. However, performance problems such as jaggy and ringing artifacts exist in SRCNN. Moreover, in order to realize a real-time upconverting system for high-resolution video streams such as 4K/8K 60 fps, problems such as processing delay and implementation cost remain. In the present paper, we propose high-performance super-resolution via patch-based deep neural network (SR-PDNN) rather than a convolutional neural network (CNN). Despite the very simple end-to-end learning system, the SR-PDNN achieves higher performance than the conventional CNN-based approach. In addition, this system is suitable for ultra-low-delay video processing by hardware implementation using an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA).
Harufusa KONDOH Hiromi NOTANI Hideaki YAMANAKA Keiichi HIGASHITANI Hirotaka SAITO Isamu HAYASHI Yoshio MATSUDA Kazuyoshi OSHIMA Masao NAKAYA
A new shared multibuffer architecture for high-speed ATM (Asynchronous Transfer Mode) switch LSIs is described. Multiple buffer memories are located between two crosspoint switches. By controlling the input-side crosspoint switch so as to equalize the utilization rate of each buffer memory, these multiple buffer memories can be recognized as a single large shared buffer memory. High utilization efficiency of buffer memory can thus be achieved, and the cell loss ratio is minimized. By accessing the buffer memories in parallel via crosspoint switches, the time required to access the buffer memories is greatly reduced. This feature enables high-speed operation of the switch. The shared multibuffer architecture was implemented in a switch LSI using 0.8-µm BiCMOS process technology. Experimental results revealed that this chip can operate at more than 125 MHz. Bit-sliced eight switch LSIs operating at 78 MHz construct a 622-Mb/s 88 ATM switching system with a buffer size of 1,024 ATM cells. Power consumption of the switch LSI was 3 W.
Hiromi NOTANI Masayuki KOYAMA Ryuji MANO Hiroshi MAKINO Yoshio MATSUDA Osamu TOMISAWA Shuhei IWADE
A 64-bit 100-MHz multimedia DSP core has been designed using 0.15-µ m CMOS technology. An improved Auto-Backgate-Controlled MT-CMOS (ABC-MT-CMOS) circuit with a charge pump is adopted to suppress the standby leakage current. The dynamic active current of whole chip was simulated to optimize the size of a switch for a power supply control. The DSP core chip, which integrates 300-kgate Logic, 64-kbyte SRAM and charge pump circuit, has 8-µ A standby leakage current. The reduction rate is 1/250.
Yasunobu NAKASE Hiroyuki KONO Yoshio MATSUDA Hisanori HAMANO
Cursor RAMs have been composed of two memory planes. A cursor pattern is stored in these planes with 2-bit data depth. While the pixel port requires data from both planes at the same time, the MPU port accesses either one of the planes at a time. Since the address space is defined differently between the ports, conventional cursor RAMs could not have dealt with these different access ways at real time. This paper proposes a dual port cursor RAM with a dynamic data alignment architecture. The architecture processes the different access ways at real time, and reduces a large amount of control circuitry. Conventional cursor RAMs have been organized with a single port memory because dual port memory cells have been large. We have applied the port swap architecture which has reduced the cell size. The control block is further simplified because the controller no longer emulate a dual port memory. The cursor RAM with these architectures is fabricated with a double metal 0. 5 µm CMOS process technology. The active area is 1. 51. 6 mm2 including a couple of shift registers and a control block. It operates up to 263 MHz at the supply voltage of 3. 3 V.
Masanori HAYASHIKOSHI Hiroaki TANIZAKI Yasumitsu MURAI Takaharu TSUJI Kiyoshi KAWABATA Koji NII Hideyuki NODA Hiroyuki KONDO Yoshio MATSUDA Hideto HIDAKA
A 1-Transistor 4-Magnetic Tunnel Junction (1T-4MTJ) memory cell has been proposed for field type of Magnetic Random Access Memory (MRAM). Proposed 1T-4MTJ memory cell array is achieved 44% higher density than that of conventional 1T-1MTJ thanks to the common access transistor structure in a 4-bit memory cell. A self-reference sensing scheme which can read out with write-back in four clock cycles has been also proposed. Furthermore, we add to estimate with considering sense amplifier variation and show 1T-4MTJ cell configuration is the best solution in IoT applications. A 1-Mbit MRAM test chip is designed and fabricated successfully using 130-nm CMOS process. By applying 1T-4MTJ high density cell and partially embedded wordline driver peripheral into the cell array, the 1-Mbit macro size is 4.04 mm2 which is 35.7% smaller than the conventional one. Measured data shows that the read access is 55 ns at 1.5 V typical supply voltage and 25C. Combining with conventional high-speed 1T-1MTJ caches and proposed high-density 1T-4MTJ user memories is an effective on-chip hierarchical non-volatile memory solution, being implemented for low-power MCUs and SoCs of IoT applications.
Hidehiro TAKATA Rei AKIYAMA Tadao YAMANAKA Haruyuki OHKUMA Yasue SUETSUGU Toshihiro KANAOKA Satoshi KUMAKI Kazuya ISHIHARA Atsuo HANAMI Tetsuya MATSUMURA Tetsuya WATANABE Yoshihide AJIOKA Yoshio MATSUDA Syuhei IWADE
An on-chip, 64-Mb, embedded, DRAM MPEG-2 encoder LSI with a multimedia processor has been developed. To implement this large-scale and high-speed LSI, we have developed the hierarchical skew control of multi-clocks, with timing verification, in which cross-talk noise is considered, and simple measures taken against the IR drop in the power lines through decoupling capacitors. As a result, the target performance of 263 MHz at 1.5 V has been successfully attained and verified, the cross-talk noise has been considered, and, in addition, it has become possible to restrain the IR drop to 166 mV in the 162 MHz operation block.
Hideaki YAMANAKA Hirotaka SAITO Hirotoshi YAMADA Harufusa KONDOH Hiromi NOTANI Yoshio MATSUDA Kazuyoshi OSHIMA
A new ATM switch architecture, named shared multibuffering, features great advantages on memory access speed for a large switch, and overall size of buffer memories to achieve excellent cell-loss performance. We have developed a 622-Mb/s 88 shared multibuffer ATM switch with multicast functions and hierarchical queueing functions to accommodate 156-Mb/s, 622-Mb/s and 2.4-Gb/s interfaces. Implementation of the shared multibuffer ATM switch is described with respect to the four sorts of 0.8-µm BiCMOS LSIs and ATM switch boards. The switch board/type-1, with C1-LSI, allows to accommodate effectively 156-Mb/s and 622-Mb/s interfaces, which is suitable for an ATM access system. The switch board/type-2, with C2-LSI, can provide multicast functions and accommodate a 2.4-Gb/s interface. By using four switch boards, it is possible to apply them to a 2.4-Gb/s ATM loop system.
Mikio ASAKURA Yoshio MATSUDA Katsuhiro TSUKAMOTO Kazuyasu FUJISHIMA Tsutomu YOSHIHARA
This letter reports a charge collection experiment of alpha-particle-induced carriers in the cell arrays of the 1 Mb DRAM. It is indicated that this experiment is effective to estimate the soft error rate of VLSI memories with various kinds of structures.
Masayuki MIYAMA Yuusuke INOIE Takafumi KASUGA Ryouichi INADA Masashi NAKAO Yoshio MATSUDA
This paper describes a 158 MS/s JPEG 2000 codec with an embedded block coder (EBC) based on bit-plane and pass-parallel architecture. The EBC contains bit-plane coders (BPCs) corresponding to each bit-plane in a code-block. The upper and the lower bit-plane coding overlap in time with a 1-stripe and 1-column gap. The bit-modeling passes in the bit-plane coding also overlap in time with the same gap. These methods increase throughput by 30 times in comparison with the conventional. In addition, the methods support not only vertically causal mode, but also regular mode, which enhances the image quality. Furthermore, speculative decoding is adopted to increase throughput. This codec LSI was designed using 0.18 µm process. The core area is 4.74.7 mm2 and the frequency is 160 MHz. A system including the codec enables image transmission of PC desktop with 8 ms delay.
Kousuke IMAMURA Ryota HONDA Yoshifumi KAWAMURA Naoki MIURA Masami URANO Satoshi SHIGEMATSU Tetsuya MATSUMURA Yoshio MATSUDA
The development of an extremely efficient packet inspection algorithm for lookup engines is important in order to realize high throughput and to lower energy dissipation. In this paper, we propose a new lookup engine based on a combination of a mismatch detection circuit and a linked-list hash table. The engine has an automatic rule registration and deletion function; the results are that it is only necessary to input rules, and the various tables included in the circuits, such as the Mismatch Table, Index Table, and Rule Table, will be automatically configured using the embedded hardware. This function utilizes a match/mismatch assessment for normal packet inspection operations. An experimental chip was fabricated using 40-nm 8-metal CMOS process technology. The chip operates at a frequency of 100MHz under a power supply voltage of VDD =1.1V. A throughput of 100Mpacket/s (=51.2Gb/s) is obtained at an operating frequency of 100MHz, which is three times greater than the throughput of 33Mpacket/s obtained with a conventional lookup engine without a mismatch detection circuit. The measured energy dissipation was a 1.58pJ/b·Search.
Hideto HIDAKA Yoshio MATSUDA Kazuyasu FUJISHIMA
A new DRAM bitline architecture, called Divided/Pausing Bitline Sensing Scheme (DIPS), is proposed for DRAM core of 64 Mbit level and beyond. This architecture eliminates the inter-bitline coupling noise and realizes a high speed sensing operation.
Tsukasa OOISHI Masaki TSUKUDE Kazutani ARIMOTO Yoshio MATSUDA Kazuyasu FUJISHIMA
We propose an advanced hyper parallel testing method which improves the line-mode test method by adding data inversion registers which we call the Advanced Line-mode Test (ALT). This testing method has the same testing capability as the conventional bit-by-bit and multi-bit test method (MBT), because it enables the application of a high sensitive and practical test patterns under the hyper parallel condition. The testing time for fixed data patterns are reduced by 1/1900 (all-0/1, checker board, and etc.). Moreover, the ALT can be applicable to the continuous patterns (march, walking, and etc.). The ALT improved from the line-mode test with registers and comparators (LTR) is able to applicable to the most test patterns and to reduce the testing time remarkably, and is suitable for the ULSI memories.
Yuichiro MURACHI Yuki FUKUYAMA Ryo YAMAMOTO Junichi MIYAKOSHI Hiroshi KAWAGUCHI Hajime ISHIHARA Masayuki MIYAMA Yoshio MATSUDA Masahiko YOSHIMOTO
This paper describes an optical-flow processor core for real-time video recognition. The processor is based on the Pyramidal Lucas and Kanade (PLK) algorithm. It features a smaller chip area, higher pixel rate, and higher accuracy than conventional optical-flow processors. Introduction of search range limitation and the Carman filter to the original PLK algorithm improve the optical-flow accuracy, and reduce the processor hardware cost. Furthermore, window interleaving and window overlap methods reduces the necessary clock frequency of the processor by 70%, allowing low-power characteristics. We first verified the PLK algorithm and architecture with a proto-typed FPGA implementation. Then, we designed a VLSI processor that can handle a VGA 30-fps image sequence at a clock frequency of 332 MHz. The core size and power consumption are estimated at 3.503.00 mm2 and 600 mW, respectively, in a 90-nm process technology.
Yoshiki YUNBE Masayuki MIYAMA Yoshio MATSUDA
This paper describes an affine motion estimation processor for real-time video segmentation. The processor estimates the dominant motion of a target region with affine parameters. The processor is based on the Pseudo-M-estimator algorithm. Introduction of an image division method and a binary weight method to the original algorithm reduces data traffic and hardware costs. A pixel sampling method is proposed that reduces the clock frequency by 50%. The pixel pipeline architecture and a frame overlap method double throughput. The processor was prototyped on an FPGA; its function and performance were subsequently verified. It was also implemented as an ASIC. The core size is 5.05.0 mm2 in 0.18 µm process, standard cell technology. The ASIC can accommodate a VGA 30 fps video with 120 MHz clock frequency.
Yoshifumi KAWAMURA Naoya OKADA Yoshio MATSUDA Tetsuya MATSUMURA Hiroshi MAKINO Kazutami ARIMOTO
A Field Programmable Sequencer and Memory (FPSM), which is a programmable unit exclusively optimized for peripherals on a micro controller unit, is proposed. The FPSM functions as not only the peripherals but also the standard built-in memory. The FPSM provides easier programmability with a smaller area overhead, especially when compared with the FPGA. The FPSM is implemented on the FPGA and the programmability and performance for basic peripherals such as the 8 bit counter and 8 bit accuracy Pulse Width Modulation are emulated on the FPGA. Furthermore, the FPSM core with a 4K bit SRAM is fabricated in 0.18µm 5 metal CMOS process technology. The FPSM is an half the area of FPGA, its power consumption is less than one-fifth.
Hiroaki SUZUKI Hiroyuki KAWAI Hiroshi MAKINO Yoshio MATSUDA
A VLIW (Very Long Instruction Word) architecture with a new code compaction method has been proposed. For a 3D-geometry processor, we consider two types of 2-issue VLIW architectures, the floating-point execution accelerating VLIW (FP-VLIW) and the data-move enhancing VLIW (MV-VLIW) architectures, as expansions of a Single-Streaming Single Instruction, Multiple Data (SS-SIMD) architecture. To solve the code bloat problem which is common to VLIW architectures, the proposed method makes it possible to compact original codes into the VLIW codes by software tools and decompact the VLIW codes by a simple hardware decompactor composed of an instruction swap circuit on a chip. Speeds and code densities of the two VLIWs with the code compaction are compared to the SS-SIMD with the same instruction set and the same building blocks. The FP-VLIW shows the fastest speed performance in the evaluation results of the viewperf CDRS-03 benchmark programs. It is 36% faster than the SS-SIMD used as reference. The proposed compaction method keeps the 95% code density of the SS-SIMD. One test program shows that the code density of the MV-VLIW is higher than that of the SS-SIMD. This result demonstrates that the merit of compacting nops can be greater than the VLIW penalty. The FP-VLIW architecture with the code compaction achieves 1.36 times the speed performance without significant code-density deterioration.