Akira FUJIMAKI Daiki HASEGAWA Yuto TAKESHITA Feng LI Taro YAMASHITA Masamitsu TANAKA
Yihao WANG Jianguo XI Chengwei XIE
Feng TIAN Zhongyuan ZHOU Guihua WANG Lixiang WANG
Yukihiro SUZUKI Mana SAKAMOTO Taiyou NAGASHIMA Yosuke MIZUNO Heeyoung LEE
Yo KUMANO Tetsuya IIZUKA
Wisansaya JAIKEANDEE Chutiparn LERTVACHIRAPAIBOON Dechnarong PIMALAI Kazunari SHINBO Keizo KATO Akira BABA
Satomitsu Imai Shoya Ishii Nanako Itaya
Satomitsu Imai Takekusu Muraoka Kaito Tsujioka
Takahide Mizuno Hirokazu Ikeda Hiroki Senshu Toru Nakura Kazuhiro Umetani Akihiro Konishi Akihito Ogawa Kaito Kasai Kosuke Kawahara
Yongshan Hu Rong Jin Yukai Lin Shunmin Wu Tianting Zhao Yidong Yuan
Kewen He Kazuya Kobayashi
Tong Zhang Kazuya Kobayashi
Yuxuan PAN Dongzhu LI Mototsugu HAMADA Atsutake KOSUGE
Shigeyuki Miyajima Hirotaka Terai Shigehito Miki
Xiaoshu CHENG Yiwen WANG Hongfei LOU Weiran DING Ping LI
Akito MORITA Hirotsugu OKUNO
Chunlu WANG Yutaka MASUDA Tohru ISHIHARA
Dai TAGUCHI Takaaki MANAKA Mitsumasa IWAMOTO
Kento KOBAYASHI Riku IMAEDA Masahiro MORIMOTO Shigeki NAKA
Yoshinao MIZUGAKI Kenta SATO Hiroshi SHIMADA
Baoquan ZHONG Zhiqun CHENG Minshi JIA Bingxin LI Kun WANG Zhenghao YANG Zheming ZHU
Kazuya TADA
Suguru KURATOMI Satoshi USUI Yoko TATEWAKI Hiroaki USUI
Yoshihiro NAKA Masahiko NISHIMOTO Mitsuhiro YOKOTA
Tsuneki YAMASAKI
Kengo SUGAHARA
Cuong Manh BUI Hiroshi SHIRAI
Hiroyuki DEGUCHI Masataka OHIRA Mikio TSUJI
Yongzhe Wei Zhongyuan Zhou Zhicheng Xue Shunyu Yao Haichun Wang
Mio TANIGUCHI Akito IGUCHI Yasuhide TSUJI
Kouji SHIBATA Masaki KOBAYASHI
Zhi Earn TAN Kenjiro MATSUMOTO Masaya TAKAGI Hiromasa SAEKI Masaya TAMURA
Koya TANIKAWA Shun FUJII Soma KOGURE Shuya TANAKA Shun TASAKA Koshiro WADA Satoki KAWANISHI Takasumi TANABE
Masaru KITSUREGAWA Weikang YANG Satoshi HIRANO Masanobu HARADA Minoru NAKAMURA Kazuhiro SUZUKI TaKayuki TAMURA Mikio TAKAGI
This paper presents an overview of the SDC-I (Super Database Computer I) developed at the University of Tokyo, Japan. The purpose of the project was to build a high performance SQL server which emphasizes query processing over transaction processing. Recently relational database systems tend to be used for heavy decision support queries, which include many join, aggregation, and order-by operations. At present high-end mainframes are used for these applications requiring several hours in some cases. While the system architecture for high traffic transaction processing systems is well established, that for adhoc query processing has not yet adequately understood. SDC-I proved that a parallel machine could attain significant performance improvements over a coventional sequential machine through the exploitation of the high degree of parallelism present in relational query processing. A unique bucket spreading parallel hash join algorithm is employed in SDC, which makes the system very robust in the presense of data skew and allows SDC to attain almost linear performance scalability. SDC adopts a hybrid parallel architecture, where globally it is a shared nothing architecture, that is, modules are connected through the multistage network, but each module itself is a symmetric multiprocessor system. Although most of the hardware elements use commodity microprocessors for improved performance to cost, only the interconnection network incorporates the special function to support our parallel relational algorithm. Data movement over the memory and the network, rather than computation, is heavy for I/O intensive database processing. A dedicated software system was carefully designed for efficient data movement. The implemented prototype consists of two modules. Its hardware and software organization is described. The performance monitoring tool was developed to visualize the system activities, which showed that SDC-I works very efficiently.
It is demonstrated that the enhancement in the functional capability of an elemental transistor is quite essential in developing human-like intelligent electronic systems. For this purpose we have introduced the concept of four-terminal devices. Four-terminal devices have an additional dimension in the degree of freedom in controlling currents as compared to the three-terminal devices like bipolar and MOS transistors. The importance of the four-terminal device concept is demonstrated taking the neuron MOS transistor (abbreviated as neuMOS or νMOS) and its circuit applications as examples. We have found that any Boolean functin can be realized by a two-stage configuratin of νMOS inverters. In addition, the variable threshold nature of the device allows us to build real-time reconfigurable logic circuits (no floating gate charging effect is involved in varying the threshold). Based on the principle, we have developed Soft-Hardware Logic Circuits and Real-Time Rule-Variable Data Matching Circuits. A winner-take-all circuit which finds the largest signal by hardware parallel processing has been also developed. The circuit is applied to building an associative memory which is different from Hopfield network in both principle and operation. The hardware algorithm in which binary, multivalue, and analog operations are merged at a very device level is quite essential to establish intelligent information processing systems based on highly flexible, real-time programmable hardwares realized by four-terminal devices.
Takahiro HANYU Maho KUWAHARA Tatsuo HIGUCHI
This paper presents a low-power 8-valued cellular array VLSI for high-speed image processing based on logical neighborhood operations with 3
Shoji KAWAHITO Kazuyuki TAKEDA Takanori NISHIMURA Yoshiaki TADOKORO
This paper presents a discrete Fourier analyzer using analog VLSI technology. An analog current-mode technique is employed for implementing it by a regular array structure based on the straight-forward discrete Fourier transform (DFT) algorithm. The basic components are 1-dimensional (1-D) analog current-mode multiplier array for fixed coefficient multiplication, two-dimensional (2-D) analog switch array and wired summations. The proposed scheme can process speedily N-point DFT in a time proportional to N. Possibility of the realization of the analog DFT VLSI based on 1 µm technology is discussed from the viewpoints of precision, speed, area, and power dissipation. In the case of 1024-point DFT, the standard deviation of the total error is estimated to be about 2%, the latency, or processing time is about 110 µs, and the signal sample rate based on a pipeline manner is about 4.7 MHz. A prototype MOS integrated circuit of the 16-point multiplier array has been implemented and a typical operation using the multiplier array has been confirmed.
Masakatsu MARUYAMA Hiroyuki NAKAHIRA Shiro SAKIYAMA Toshiyuki KOHDA Susumu MARUNO Yasuharu SHIMEKI
This paper discusses a digital neuroprocessor named Quantizer Neuron Chip (QNC) employing the Quantizer Neuron model and two newly developed schemes; "concurrent processing of quantizer neuron" and "removal of ineffective calculations". QNC simulates neural networks named the Multi-Functional Layered Network (MFLN) with 64 output neurons, 4672 quantizer neurons and two million synaptic weights and can be used for character or image recognition and learning. The processing speed of the chip achieved 1.6 µseconds per output neuron for recognition and 20 million connections updated per second (MCUPS) for learning. In addition, QNC can execute multichip operation for increasing the size of networks. We applied QNC to handwritten numeral recognition and realized high speed recognition and learning. QNC is implemented in a 1.2 µm double metal CMOS with sea of gates' technology and contains 27,000 gates on a 10.99
Luigi RAFFO Silvio P. SABATINI Giacomo INDIVERI Giovanni NATERI Giacomo M. BISIO
The paper describes the architecture and the simulated performances of a memory-based chip that emulates human cortical processing in early visual tasks, such as texture segregation. The featural elements present in an image are extracted by a convolution block and subsequently processed by the cortical chip, whose neurons, organized into three layers, gain relational descriptions (intelligent processing) through recurrent inhibitory/excitatory interactions between both inter-and intra-layer parallel pathways. The digital implementation of this architecuture directly maps the set of equations determining the status of the cortical network to achieve an optimal exploitation of VLSI technology in neural computation. Neurons are mapped into a memory matrix whose elements are updated through a programmable computational unit that implements synaptic interconnections. By using 0.5 µm-CMOS technology, full cortical image processing can be attained on a single chip (20
A fuzzy microprocessor is developed using 1.2 µm CMOS process. The inference scheme for the if-then fuzzy rules consists of three main steps i. e. if-part process, then-part process and defuzzification. In order to realize very high-speed inference and moderate programmability, we introduce three-type different structures i.e. SIMD, logic-in-memory and Wallace tree structures which are suitable for the three main steps. The inference speed including defuzzification is 7.5 MFLIPS which is 12.9 times higher than the previous VLSI implementation, and it can carry out many rules (960 rules) and many input and output variables (16 variables).
Saed SAMADI Akinori NISHIHARA Nobuo FUJII
In this paper we propose a method for increasing the reliability in multiprocessor realization of lowpass and highpass FIR digital filters possessing a maximally flat magnitude response. This method is based on the use of array realization of the filter which has been proposed earlier by the authors. It is shown that if a processing module of the array functions erroneously, it is possible to exclude the module and still obtain a lowpass FIR filter. However, as a price we should tolerate a slight degradation in the magnitude response of the filter that is equivalent to a wider transition band. We also analyze the behavior of the filter when our proposed schemes are implemented on more than one module. The justification of our approach is based on that a slight degradation of the spectral characteristics of a filter may be well tolerated in most filtering applications and thus a graceful degradation in the frequency domain can sufficiently reduce the vulnerability to errors.
Masafumi TAKAHASHI Hiroshige FUJII Emi KANEKO Takeshi YOSHIDA Toshinori SATO Hiroyuki TAKANO Haruyuki TAGO Seigo SUZUKI Nobuyuki GOTO
A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.
Yasuaki SAWANO Bumchul KIM Michitaka KAMEYAMA
In intelligent integrated systems such as robotics for autonomous work, it is essential to respond to the change of the environment very quickly. Therefore, the development of special-purpose VLSI processors for intelligent integrated systems with small latency becomes an very important subject. In this paper, we present a scheduling algorithm for high-level synthesis. The input to the scheduler is a behavioral description which is viewed as a data flow graph (DFG). The scheduler minimizes the latency, which is the delay of the critical path in the DFG, and minimizes the number of functional units and buses by improving the utilization rates. By using an integer linear programming, the scheduler optimally assigns nodes and arcs in the DFG into steps.
Masanori HARIYAMA Michitaka KANEYAMA
Real-time collision detection is one of the most important intelligent processings in robotics. In collision detection, a large storage capasity is usually required to store the 3-dimensional information on the obstacles located in a workspace. Moreover, high-computational power is essential in not only coordinate transformation but also matching operation. In the proposed collision detection VLSI processor, the matching operation is drastically accelerated by using a content-addressable memory (CAM). A new obstacle representation based on a union of rectangular solids is also used to reduce the obstacle memory capacity, so that the collision detection can be performed by only magnitude comparison in parallel. Parallel architecture using several identical processor elements (PEs) is employed to perform the coordinate transformation at high speed, and each PE performs coordinate transformation at high speed based on the COordinate Rotation DIgital Computation (CORDIC) algorithms. When the 16 PEs and 144-kb CAM are used, the performance is evaluated to be 90 ms.
Yoshifumi SASAKI Michitaka KAMEYAMA
In robot vision system, enormously large computation power is required to perform three-dimensional (3-D) instrumentation and object recognition. However, many kinds of complex and irregular operations are required to make accurate 3-D instrumentation and object recognition in the conventional method for software implementation. In this paper, a VLSI-oriented Model-Based Robot Vision (MBRV) processor is proposed for high-speed and accurate 3-D instrumentation and object recognition. An input image is compared with two-dimensional (2-D) silhouette images which are generated from the 3-D object models by means of perspective projection. Because the MBRV algorithm always gives the candidates for the accurate 3-D instrumentation and object recognition result with simple and regular procedures, it is suitable for the implementation of the VLSI processor. Highly parallel architecture is employed in the VLSI processor to reduce the latency between the image acquisition and the output generation of the 3-D instrumentation and object recognition results. As a result, 3-D instrumentation and object recognition can be performed 10000 times faster than a 28.5 MIPS workstation.
Yoshichika FUJIOKA Michitaka KAMEYAMA Nobuhiro TOMABECHI
In digital control, it is essential to make the delay time for a large number of multiply-additions small because of sensor feedback. To meet the requirement, an architecture of the reconfigurable parallel processor using field-programmable gate arrays (FPGAs) is proposed. Although the performance is drastically increased in the full custom VLSI implementation, even the reconfigurable parallel processor using FPGAs becomes useful for many practical digital control applications. The performance evaluation shows that the delay time for the resolved acceleration cotrol computation of a twelve-degrees-of-freedom (DOF) redundant manipulator becomes about 70 µs which is about seventeen times faster than that of a parallel processor approach using conventional digital signal processors (DSPs).
The ultimate minimum energy of switching mechanism for MOS integrated circuits have been studied. This report elucidates the evaluation methods for minimum switching energy of instantaneous discharged mechanism after charging one, namely, recycled energy of the MOS device. Two approaches are implemented to capture this concept. One is a switching energy by the time-dependent gate capacitance (TDGC) model ; the other one by results developed by transient device simulation, which was implemented using Finite Element Method (FEM). It is understood that the non-recycled minimum swhiching energies by both approaches show a good agreement. The recycled energies are then calculated at various sub-micron gate MOS/SOI devices and can be ultra-low power of the MOS integrated circuits, which may be possible to build recycled power circuitry for super energy-saving in the future new MOS LSI. From those results, (1) the TDGC is simultaneously verified by consistent match of the non-recycled minimum switching energies; (2) the recycled switching energy is found to be the ultimate lower bound of power for MOS device; (3) the recycled switching energy can be saved up to around 80% of that of current MOS LSI.
Mitsuhiro WADA Yasumitsu MIYAZAKI
This paper proposes a waveguide type optical amplifier which is constructed in 1.3 at.% Nd doped yttrium gallium garnet thin films deposited on yttrium aluminum garnet substrates by using RF sputtering . The crystalline thin film with a satisfactory stoichiometric composition is obtained by annealing at 1000