Author Search Result

[Author] Yoshio MATSUDA(22hit)

1-20hit(22hit)

  • A Complete Charge Recycling TCAM with Checkerboard Array Arrangement for Low Power Applications

    Katsumi DOSAKA  Daisuke OGAWA  Takahito KUSUMOTO  Masayuki MIYAMA  Yoshio MATSUDA  

     
    PAPER-Integrated Electronics

      Vol:
    E93-C No:5
      Page(s):
    685-695

    Architecture of a low power Ternary Content Addressable Memory (TCAM) is proposed. The TCAM is a powerful engine for search and sort processing, but it has two serious problems, large power consumption and large power line noise. To solve these problems, we have developed a charge recycling scheme for match lines and search lines. A combination of the newly introduced PMOS CAM cell together with the conventional NMOS CAM cell realizes match line charge recycling. A checkerboard arrangement of the NMOS and the PMOS cell array enables search line charge recycling. By using these technologies, the power consumption of the TCAM can be reduced to 50% of conventional designs, and as a result, the power line noise is also reduced. An experimental chip has been fabricated in 180-nm 6-metal process. The power consumption of this chip is 6.3 fJ/bit/search, which is half of the conventional scheme.

  • A 1.5-V 250-MHz to 3.0-V 622-MHz Operation CMOS Phase-Locked Loop with Precharge Type Phase-Frequency Detector

    Harufusa KONDOH  Hiromi NOTANI  Tsutomu YOSHIMURA  Hiroshi SHIBATA  Yoshio MATSUDA  

     
    PAPER-Digital Circuits

      Vol:
    E78-C No:4
      Page(s):
    381-388

    A new approach which implements a simple, high-speed phase detector with precharge logic will be presented. The minimum detectable phase difference is 40 psec, which is less than a half of conventional detectors. A current mode ring oscillator with a complementary-input bias generator has also been developed to enhance the dynamic range of the VCO under a low supply voltage. A fully CMOS PLL was designed using 0.5-µm technology. By virtue of this simple, fast detector, the wide operation range of 250 MHz at 1.5 V to 622 MHz at 3.0 V was achieved by simulation.

  • Design and Implementation of 176-MHz WXGA 30-fps Real-Time Optical Flow Processor

    Yu SUZUKI  Masato ITO  Satoshi KANDA  Kousuke IMAMURA  Yoshio MATSUDA  Tetsuya MATSUMURA  

     
    PAPER

      Vol:
    E100-A No:12
      Page(s):
    2888-2900

    This paper describes the design and implementation of a real-time optical flow processor using a single field-programmable gate array (FPGA) chip. By introducing the modified initial flow generation method, the successive over-relaxation (SOR) method for both layers, the optimization of the reciprocal operation method, and the image division method, it is now possible to both reduce hardware requirements and improve flow accuracy. Additionally, by introducing a pipeline structure to this processor, high-throughput hardware implementation could be achieved. Total logic cell (LC) amounts and processer memory capacity are reduced by about 8% and 16%, respectively, compared to our previous hierarchical optical flow estimation (HOE) processor. The results of our evaluation confirm that this processor can perform 30 fps wide extended graphics array (WXGA) 175.7MHz real-time optical flow processing with a single FPGA.

  • An Efficient Self-Timed Queue Architecture for ATM Switch LSIs

    Harufusa KONDOH  Hideaki YAMANAKA  Masahiko ISHIWAKI  Yoshio MATSUDA  Masao NAKAYA  

     
    PAPER-Multimedia System LSIs

      Vol:
    E77-C No:12
      Page(s):
    1865-1872

    A new approach to implement queues for controlling ATM switch LSI is presented. In many conventional architecture, external FIFOs are provided for each output link and used to manage the address of the buffer in an ATM switch. We reduce the number of FIFOs by using a self-timed queue with a search circuit that finds the earliest entry for each output link. Using this architecture, number of the FIFOs is reduced to 1/N, where N is the switch size. Delay priority and multicasting can be supported without doubling the number of the queues. This new queue can also be utilized as an ATM switch by itself. Evaluation chip was fabricated using 0.5-µm CMOS process technology. Inter-stage transfer speed over 500 MHz and cycle time over 125 MHz was obtained. This performance is enough for a 622-Mbps 1616 ATM Switch.

  • High-Performance Super-Resolution via Patch-Based Deep Neural Network for Real-Time Implementation

    Reo AOKI  Kousuke IMAMURA  Akihiro HIRANO  Yoshio MATSUDA  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2018/08/20
      Vol:
    E101-D No:11
      Page(s):
    2808-2817

    Recently, Super-resolution convolutional neural network (SRCNN) is widely known as a state of the art method for achieving single-image super resolution. However, performance problems such as jaggy and ringing artifacts exist in SRCNN. Moreover, in order to realize a real-time upconverting system for high-resolution video streams such as 4K/8K 60 fps, problems such as processing delay and implementation cost remain. In the present paper, we propose high-performance super-resolution via patch-based deep neural network (SR-PDNN) rather than a convolutional neural network (CNN). Despite the very simple end-to-end learning system, the SR-PDNN achieves higher performance than the conventional CNN-based approach. In addition, this system is suitable for ultra-low-delay video processing by hardware implementation using an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA).

  • A Shared Multibuffer Architecture for High-Speed ATM Switch LSIs

    Harufusa KONDOH  Hiromi NOTANI  Hideaki YAMANAKA  Keiichi HIGASHITANI  Hirotaka SAITO  Isamu HAYASHI  Yoshio MATSUDA  Kazuyoshi OSHIMA  Masao NAKAYA  

     
    PAPER-Improved Binary Digital Architectures

      Vol:
    E76-C No:7
      Page(s):
    1094-1101

    A new shared multibuffer architecture for high-speed ATM (Asynchronous Transfer Mode) switch LSIs is described. Multiple buffer memories are located between two crosspoint switches. By controlling the input-side crosspoint switch so as to equalize the utilization rate of each buffer memory, these multiple buffer memories can be recognized as a single large shared buffer memory. High utilization efficiency of buffer memory can thus be achieved, and the cell loss ratio is minimized. By accessing the buffer memories in parallel via crosspoint switches, the time required to access the buffer memories is greatly reduced. This feature enables high-speed operation of the switch. The shared multibuffer architecture was implemented in a switch LSI using 0.8-µm BiCMOS process technology. Experimental results revealed that this chip can operate at more than 125 MHz. Bit-sliced eight switch LSIs operating at 78 MHz construct a 622-Mb/s 88 ATM switching system with a buffer size of 1,024 ATM cells. Power consumption of the switch LSI was 3 W.

  • A Low Standby Current DSP Core Using Improved ABC-MT-CMOS with Charge Pump Circuit

    Hiromi NOTANI  Masayuki KOYAMA  Ryuji MANO  Hiroshi MAKINO  Yoshio MATSUDA  Osamu TOMISAWA  Shuhei IWADE  

     
    PAPER-Circuit Design

      Vol:
    E86-C No:4
      Page(s):
    597-603

    A 64-bit 100-MHz multimedia DSP core has been designed using 0.15-µ m CMOS technology. An improved Auto-Backgate-Controlled MT-CMOS (ABC-MT-CMOS) circuit with a charge pump is adopted to suppress the standby leakage current. The dynamic active current of whole chip was simulated to optimize the size of a switch for a power supply control. The DSP core chip, which integrates 300-kgate Logic, 64-kbyte SRAM and charge pump circuit, has 8-µ A standby leakage current. The reduction rate is 1/250.

  • A 250 MHz Dual Port Cursor RAM Using Dynamic Data Alignment Architecture

    Yasunobu NAKASE  Hiroyuki KONO  Yoshio MATSUDA  Hisanori HAMANO  

     
    PAPER-Electronic Circuits

      Vol:
    E81-C No:11
      Page(s):
    1750-1756

    Cursor RAMs have been composed of two memory planes. A cursor pattern is stored in these planes with 2-bit data depth. While the pixel port requires data from both planes at the same time, the MPU port accesses either one of the planes at a time. Since the address space is defined differently between the ports, conventional cursor RAMs could not have dealt with these different access ways at real time. This paper proposes a dual port cursor RAM with a dynamic data alignment architecture. The architecture processes the different access ways at real time, and reduces a large amount of control circuitry. Conventional cursor RAMs have been organized with a single port memory because dual port memory cells have been large. We have applied the port swap architecture which has reduced the cell size. The control block is further simplified because the controller no longer emulate a dual port memory. The cursor RAM with these architectures is fabricated with a double metal 0. 5 µm CMOS process technology. The active area is 1. 51. 6 mm2 including a couple of shift registers and a control block. It operates up to 263 MHz at the supply voltage of 3. 3 V.

  • A Cost-Effective 1T-4MTJ Embedded MRAM Architecture with Voltage Offset Self-Reference Sensing Scheme for IoT Applications

    Masanori HAYASHIKOSHI  Hiroaki TANIZAKI  Yasumitsu MURAI  Takaharu TSUJI  Kiyoshi KAWABATA  Koji NII  Hideyuki NODA  Hiroyuki KONDO  Yoshio MATSUDA  Hideto HIDAKA  

     
    PAPER

      Vol:
    E102-C No:4
      Page(s):
    287-295

    A 1-Transistor 4-Magnetic Tunnel Junction (1T-4MTJ) memory cell has been proposed for field type of Magnetic Random Access Memory (MRAM). Proposed 1T-4MTJ memory cell array is achieved 44% higher density than that of conventional 1T-1MTJ thanks to the common access transistor structure in a 4-bit memory cell. A self-reference sensing scheme which can read out with write-back in four clock cycles has been also proposed. Furthermore, we add to estimate with considering sense amplifier variation and show 1T-4MTJ cell configuration is the best solution in IoT applications. A 1-Mbit MRAM test chip is designed and fabricated successfully using 130-nm CMOS process. By applying 1T-4MTJ high density cell and partially embedded wordline driver peripheral into the cell array, the 1-Mbit macro size is 4.04 mm2 which is 35.7% smaller than the conventional one. Measured data shows that the read access is 55 ns at 1.5 V typical supply voltage and 25C. Combining with conventional high-speed 1T-1MTJ caches and proposed high-density 1T-4MTJ user memories is an effective on-chip hierarchical non-volatile memory solution, being implemented for low-power MCUs and SoCs of IoT applications.

  • Physical Design Methodology for On-Chip 64-Mb DRAM MPEG-2 Encoding with a Multimedia Processor

    Hidehiro TAKATA  Rei AKIYAMA  Tadao YAMANAKA  Haruyuki OHKUMA  Yasue SUETSUGU  Toshihiro KANAOKA  Satoshi KUMAKI  Kazuya ISHIHARA  Atsuo HANAMI  Tetsuya MATSUMURA  Tetsuya WATANABE  Yoshihide AJIOKA  Yoshio MATSUDA  Syuhei IWADE  

     
    PAPER-Product Designs

      Vol:
    E85-C No:2
      Page(s):
    368-374

    An on-chip, 64-Mb, embedded, DRAM MPEG-2 encoder LSI with a multimedia processor has been developed. To implement this large-scale and high-speed LSI, we have developed the hierarchical skew control of multi-clocks, with timing verification, in which cross-talk noise is considered, and simple measures taken against the IR drop in the power lines through decoupling capacitors. As a result, the target performance of 263 MHz at 1.5 V has been successfully attained and verified, the cross-talk noise has been considered, and, in addition, it has become possible to restrain the IR drop to 166 mV in the 162 MHz operation block.

  • Shared Multibuffer ATM Switches with Hierarchical Queueing and Multicast Functions

    Hideaki YAMANAKA  Hirotaka SAITO  Hirotoshi YAMADA  Harufusa KONDOH  Hiromi NOTANI  Yoshio MATSUDA  Kazuyoshi OSHIMA  

     
    PAPER-Switching and Communication Processing

      Vol:
    E79-B No:8
      Page(s):
    1109-1120

    A new ATM switch architecture, named shared multibuffering, features great advantages on memory access speed for a large switch, and overall size of buffer memories to achieve excellent cell-loss performance. We have developed a 622-Mb/s 88 shared multibuffer ATM switch with multicast functions and hierarchical queueing functions to accommodate 156-Mb/s, 622-Mb/s and 2.4-Gb/s interfaces. Implementation of the shared multibuffer ATM switch is described with respect to the four sorts of 0.8-µm BiCMOS LSIs and ATM switch boards. The switch board/type-1, with C1-LSI, allows to accommodate effectively 156-Mb/s and 622-Mb/s interfaces, which is suitable for an ATM access system. The switch board/type-2, with C2-LSI, can provide multicast functions and accommodate a 2.4-Gb/s interface. By using four switch boards, it is possible to apply them to a 2.4-Gb/s ATM loop system.

  • Mechanism of Bit Line Mode Soft Error for DRAM

    Mikio ASAKURA  Yoshio MATSUDA  Katsuhiro TSUKAMOTO  Kazuyasu FUJISHIMA  Tsutomu YOSHIHARA  

     
    LETTER-Semiconductor Devices

      Vol:
    E70-E No:11
      Page(s):
    1060-1061

    This letter reports a charge collection experiment of alpha-particle-induced carriers in the cell arrays of the 1 Mb DRAM. It is indicated that this experiment is effective to estimate the soft error rate of VLSI memories with various kinds of structures.

  • A 158 MS/s JPEG 2000 Codec with a Bit-Plane and Pass Parallel Embedded Block Coder for Low Delay Image Transmission

    Masayuki MIYAMA  Yuusuke INOIE  Takafumi KASUGA  Ryouichi INADA  Masashi NAKAO  Yoshio MATSUDA  

     
    PAPER

      Vol:
    E91-A No:8
      Page(s):
    2025-2034

    This paper describes a 158 MS/s JPEG 2000 codec with an embedded block coder (EBC) based on bit-plane and pass-parallel architecture. The EBC contains bit-plane coders (BPCs) corresponding to each bit-plane in a code-block. The upper and the lower bit-plane coding overlap in time with a 1-stripe and 1-column gap. The bit-modeling passes in the bit-plane coding also overlap in time with the same gap. These methods increase throughput by 30 times in comparison with the conventional. In addition, the methods support not only vertically causal mode, but also regular mode, which enhances the image quality. Furthermore, speculative decoding is adopted to increase throughput. This codec LSI was designed using 0.18 µm process. The core area is 4.74.7 mm2 and the frequency is 160 MHz. A system including the codec enables image transmission of PC desktop with 8 ms delay.

  • A 100-MHz 51.2-Gb/s Packet Lookup Engine with Automatic Table Update Function

    Kousuke IMAMURA  Ryota HONDA  Yoshifumi KAWAMURA  Naoki MIURA  Masami URANO  Satoshi SHIGEMATSU  Tetsuya MATSUMURA  Yoshio MATSUDA  

     
    PAPER-Communication Theory and Signals

      Vol:
    E100-A No:10
      Page(s):
    2123-2134

    The development of an extremely efficient packet inspection algorithm for lookup engines is important in order to realize high throughput and to lower energy dissipation. In this paper, we propose a new lookup engine based on a combination of a mismatch detection circuit and a linked-list hash table. The engine has an automatic rule registration and deletion function; the results are that it is only necessary to input rules, and the various tables included in the circuits, such as the Mismatch Table, Index Table, and Rule Table, will be automatically configured using the embedded hardware. This function utilizes a match/mismatch assessment for normal packet inspection operations. An experimental chip was fabricated using 40-nm 8-metal CMOS process technology. The chip operates at a frequency of 100MHz under a power supply voltage of VDD =1.1V. A throughput of 100Mpacket/s (=51.2Gb/s) is obtained at an operating frequency of 100MHz, which is three times greater than the throughput of 33Mpacket/s obtained with a conventional lookup engine without a mismatch detection circuit. The measured energy dissipation was a 1.58pJ/b·Search.

  • A Divided/Pausing Bitline Sensing Scheme (DIPS) for ULSI DRAM Core

    Hideto HIDAKA  Yoshio MATSUDA  Kazuyasu FUJISHIMA  

     
    LETTER-Integrated Circuits

      Vol:
    E73-E No:11
      Page(s):
    1852-1854

    A new DRAM bitline architecture, called Divided/Pausing Bitline Sensing Scheme (DIPS), is proposed for DRAM core of 64 Mbit level and beyond. This architecture eliminates the inter-bitline coupling noise and realizes a high speed sensing operation.

  • A Line-Mode Test with Data Register for ULSI Memory Architecture

    Tsukasa OOISHI  Masaki TSUKUDE  Kazutani ARIMOTO  Yoshio MATSUDA  Kazuyasu FUJISHIMA  

     
    PAPER-DRAM

      Vol:
    E76-C No:11
      Page(s):
    1595-1603

    We propose an advanced hyper parallel testing method which improves the line-mode test method by adding data inversion registers which we call the Advanced Line-mode Test (ALT). This testing method has the same testing capability as the conventional bit-by-bit and multi-bit test method (MBT), because it enables the application of a high sensitive and practical test patterns under the hyper parallel condition. The testing time for fixed data patterns are reduced by 1/1900 (all-0/1, checker board, and etc.). Moreover, the ALT can be applicable to the continuous patterns (march, walking, and etc.). The ALT improved from the line-mode test with registers and comparators (LTR) is able to applicable to the most test patterns and to reduce the testing time remarkably, and is suitable for the ULSI memories.

  • A VGA 30-fps Realtime Optical-Flow Processor Core for Moving Picture Recognition

    Yuichiro MURACHI  Yuki FUKUYAMA  Ryo YAMAMOTO  Junichi MIYAKOSHI  Hiroshi KAWAGUCHI  Hajime ISHIHARA  Masayuki MIYAMA  Yoshio MATSUDA  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    457-464

    This paper describes an optical-flow processor core for real-time video recognition. The processor is based on the Pyramidal Lucas and Kanade (PLK) algorithm. It features a smaller chip area, higher pixel rate, and higher accuracy than conventional optical-flow processors. Introduction of search range limitation and the Carman filter to the original PLK algorithm improve the optical-flow accuracy, and reduce the processor hardware cost. Furthermore, window interleaving and window overlap methods reduces the necessary clock frequency of the processor by 70%, allowing low-power characteristics. We first verified the PLK algorithm and architecture with a proto-typed FPGA implementation. Then, we designed a VLSI processor that can handle a VGA 30-fps image sequence at a clock frequency of 332 MHz. The core size and power consumption are estimated at 3.503.00 mm2 and 600 mW, respectively, in a 90-nm process technology.

  • A VGA 30 fps Affine Motion Model Estimation VLSI for Real-Time Video Segmentation

    Yoshiki YUNBE  Masayuki MIYAMA  Yoshio MATSUDA  

     
    PAPER-Computer System

      Vol:
    E93-D No:12
      Page(s):
    3284-3293

    This paper describes an affine motion estimation processor for real-time video segmentation. The processor estimates the dominant motion of a target region with affine parameters. The processor is based on the Pseudo-M-estimator algorithm. Introduction of an image division method and a binary weight method to the original algorithm reduces data traffic and hardware costs. A pixel sampling method is proposed that reduces the clock frequency by 50%. The pixel pipeline architecture and a frame overlap method double throughput. The processor was prototyped on an FPGA; its function and performance were subsequently verified. It was also implemented as an ASIC. The core size is 5.05.0 mm2 in 0.18 µm process, standard cell technology. The ASIC can accommodate a VGA 30 fps video with 120 MHz clock frequency.

  • A Field Programmable Sequencer and Memory with Middle Grained Programmability Optimized for MCU Peripherals

    Yoshifumi KAWAMURA  Naoya OKADA  Yoshio MATSUDA  Tetsuya MATSUMURA  Hiroshi MAKINO  Kazutami ARIMOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E99-A No:5
      Page(s):
    917-928

    A Field Programmable Sequencer and Memory (FPSM), which is a programmable unit exclusively optimized for peripherals on a micro controller unit, is proposed. The FPSM functions as not only the peripherals but also the standard built-in memory. The FPSM provides easier programmability with a smaller area overhead, especially when compared with the FPGA. The FPSM is implemented on the FPGA and the programmability and performance for basic peripherals such as the 8 bit counter and 8 bit accuracy Pulse Width Modulation are emulated on the FPGA. Furthermore, the FPSM core with a 4K bit SRAM is fabricated in 0.18µm 5 metal CMOS process technology. The FPSM is an half the area of FPGA, its power consumption is less than one-fifth.

  • Novel VLIW Code Compaction Method for a 3D Geometry Processor

    Hiroaki SUZUKI  Hiroyuki KAWAI  Hiroshi MAKINO  Yoshio MATSUDA  

     
    PAPER-Digital Signal Processing

      Vol:
    E84-A No:11
      Page(s):
    2885-2893

    A VLIW (Very Long Instruction Word) architecture with a new code compaction method has been proposed. For a 3D-geometry processor, we consider two types of 2-issue VLIW architectures, the floating-point execution accelerating VLIW (FP-VLIW) and the data-move enhancing VLIW (MV-VLIW) architectures, as expansions of a Single-Streaming Single Instruction, Multiple Data (SS-SIMD) architecture. To solve the code bloat problem which is common to VLIW architectures, the proposed method makes it possible to compact original codes into the VLIW codes by software tools and decompact the VLIW codes by a simple hardware decompactor composed of an instruction swap circuit on a chip. Speeds and code densities of the two VLIWs with the code compaction are compared to the SS-SIMD with the same instruction set and the same building blocks. The FP-VLIW shows the fastest speed performance in the evaluation results of the viewperf CDRS-03 benchmark programs. It is 36% faster than the SS-SIMD used as reference. The proposed compaction method keeps the 95% code density of the SS-SIMD. One test program shows that the code density of the MV-VLIW is higher than that of the SS-SIMD. This result demonstrates that the merit of compacting nops can be greater than the VLIW penalty. The FP-VLIW architecture with the code compaction achieves 1.36 times the speed performance without significant code-density deterioration.

1-20hit(22hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.