Joon-Seo YIM In-Cheol PARK Chong-Min KYUNG
In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.
This paper focuses on recovering from processor transient faults in pipelined multiprocessor systems. A pipelined machine may employ out of order execution and branch prediction techniques to increase performance, thus a precise computation state would not be available. We propose an efficient scheme to maintain the precise computation state in a pipelined machine. The goal of this paper is to implement checkpointing and rollback recovery utilizing the technique of precise interrupt in a pipelined system. Detailed analysis is included to demonstrate the effectiveness of this method.
Hiroshi ISHII Hiroaki NISHIKAWA Yuji INOUE
This paper describes the effectiveness of stream-oriented data-driven scheme for achieving autonomous fault management of hyper-distributed systems such as networks based on the Telecommunications Information Networking Architecture (TINA). TINA, whose specifications are in the finalizing phase within TINA-Consortium, is aiming at achieving interoperability and reusability of telecom applications software and independent of underlying technologies. However, to actually implement TINA network, it is essential to consider the technology constraints. Especially autonomous fault management at run-time is crucial for distributed network environment because centralized control using global information is very difficult. So far many works have been done on so-called off-line management but runtime management of service failure seems immature. This paper proposes introduction of stream-oriented data-driven processors to the autonomous fault management at runtime in TINA based distributed network environment. It examines the features of distributed network applications and technology requirements to achieve fault management of those distributed applications such as effective multiprocessing of surveillance, testing, reconfiguration in addition to ordinary processing.
Sam APPLETON Shannon MORTON Michael LIEBELT
In this paper we describe the implementation of complex architectures using a general design approach for two-phase asynchronous systems. This fundamental approach, called Event Controlled Systems, can be used to widely extend the utility of two phase systems. We describe solutions that we have developed that dramatically improve the performance of static and dynamic-logic asynchronous pipelines, and briefly describe a complex microprocessor designed using ECS.
In this paper, we give a concrete example of a 10-bit video rate ADC and introduce the effect of top-down design methodology with analog-HDL from the viewpoint of utilization techniques. First, we explain that analog top-down design methodology can improve chip performance by optimizing the architecture. Next, we concretely discuss the importance of modeling and verification. Verification of the full system does not require extracting all the information for each block at the transistor level in detail. The flexible verification method that we propose can provide good and fast full chip verification. We think analog top-down disign methodology will become increasingly more important from now on because "system-on-chip" requires one chip mixed-signal system LSIs.
The response time of adders is mainly determined by the carry propagation delay. This letter deals with a scheme which combines the address addition and decoding together. Although addition is involved in the process, we show that it can be computed without carry propagation. Memory latency is one of the most performance limiting factors. The authors present a new decoder logic named fused add-decoder (FADEC), which performs address addition and decoding in a single process. FADEC can reduce memory latency by eliminating separate address addition cycle.
Tatsuji MATSUURA Eiki IMAIZUMI Takanobu ANBO
Very-low-voltage 1.2-V mixed-signal CMOS technology is a device/circuit solution aimed at ultra-low-power portable systems such as digital cellular terminals and PDAs. We have developed an experimental 1.2-V mixed analog and digital LSI circuit/device technology. This technology is based on a new transistor structure that has a 0.3-µm gate length and a low Vth of 0.4 V, and that suppresses the short-channel effect. In this paper, we will mainly discuss low-voltage analog circuit design that uses this technology. We show that low Vth is essential not only to digital circuits, but also to 1.2-V analog amplifier, A/D converter and analog switch designs. To achieve high-conversion rate A/D converters, a pipeline architecture is used for low-voltage operation. To increase the attainable gain-bandwidth of the operational amplifier of the converter, a feedforward phase-compensated three-stage amplifier is proposed. The addition of a feedforward capacitor allows a high frequency signal to pass directly to the second stage, which optimizes use of the second stage bandwidth. Pole-zero canceling is used to achieve a fast settling of the amplifier. Although gain precision is degraded by the positive feedback through the feedforward capacitor, this can be offset by increasing the equivalent second-stage gain with an inner feedforward compensated amplifier. The gain-bandwidth of the proposed double feedforward amplifier is two to three times wider than with the conventional Miller compensation. With these techniques, we used 1.2-V mixed-signal CMOS technology to create a basic logic gate with a 400-ps delay and 0.4-µW/MHz power, and a 9-bit 2-Msample/s pipeline A/D converter with power dissipation of only 4 mW.
This parer describes high-speed CMOS SRAM circuit technologies used in cache memories. In recent years, high-speed SRAM technology has led to higher cycle frequencies, but the rate of increase in the SRAM density has slowed. Operating modes of high-speed SRAMs are compared and the advantage of wave-pipelined SRAMs in terms of cycle frequency is shown. Three types of sense amplifiers used in SRAMs are also compared from the viewpoint of speed and power dissipation. Current sense amplifiers provide high-speed operation with low power dissipation, while latch-type sense amplifiers appear most suitable for ultra-low-power SRAMs. Low voltage operation and size reduction of full CMOS cells are now the most pressing issues in the development of SRAMs for cache memories.
This paper describes the results obtained of a prototype system, VeriProc/1, based on an algorithm we first presented in [13] which can prove the correctness of pipelined processors automatically without pipeline invariant, human interaction, or additional information. No timing relations such as an abstract function or β-relation is required. The only information required is to specify the location of the selectors in the design. The performance is independent of not only data width but also memory size. Detailed analysis of CPU time is presented. Further, don't-care forcing using additional data easily prepared by the user can improve performance.
Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.
Yasuhiro SUGIMOTO Shunsaku TOKITO Hisao KAKITANI Eitaro SETA
This paper describes a study to determine if a current-mode circuit is useful as an analog circuit technique for realizing submicron mixed analog-and-digital MOS LSIs. To examine this, we designed and circuit simulated a new current-mode ADC bit-block for a 3 V, 10-bit level, 20 MHz ADC with a pipeline architecture and with full current-mode approach. A new precision current-mode sample-and-hold circuit which enables operation of a bit block at a clock speed of 20 MHz was developed. Current mismatches caused by the poor output impedance of a device were also decreased by adopting a cascode configuration throughout the design. Operation with a 3 V power supply and a 20 MHz clock speed in a 3-bit A/D configuration was verified through circuit simulation using standard CMOS 0.6 µm device parameters. Gain error, mismatch of current, and linearity of the bit block with changing threshold voltage of a device were carefully examined. The bit block has a gain error of 0.2% (10-bit level), a linearity error of less than 0.1% (more than 10-bit level), and a current mismatch of DAC current sources in a bit cell of 0.2 to 0.4% (more than 8-bit level) with a 3 V power supply and 20 MHz clock speed. An 8-to 9-bit video-speed pipeline ADC can be realized without calibration. This confirms that the current-mode approach is effective.
Shinsuke OHNO Masao SATO Tatsuo OHTSUKI
CAMs (Content Addressable Memories) are functional memories which have functions such as word-parallel equivalence search, bilateral 1-bit data shifting between consecutive words, and word-parallel writing. Since CAMs can be integrated because of their regular structure, massively parallel CAM functions can be executed. Taking advantage of CAMs, Ishiura and Yajima have proposed a parallel fault simulation algorithm using a CAM. This algorithm, however, requires a large amount of CAM storage to simulate large-scale circuits. In this paper, we propose a new massively parallel fault simulation algorithm requiring less CAM storage, and compare it with Ishiura and Yajima's algorithm. Experimental results of the algorithm on CHARGE --the CAM-based hardware engine developed in our laboratory--are also reported.
Nguyen Ngoc BINH Masaharu IMAI Akichika SHIOMI Nobuyuki HIKICHI
This paper proposes a new method to design an optimal pipelined instruction set processor using formal HW/SW codesign methodology. A HW/SW partitioning algorithm for selecting an optimal pipelined architecture is introduced. The codesign task addressed in this paper is to find a set of hardware implemented operations to achieve the highest performance of an ASIP with pipelined architecture under given gate count and power consumption constraints. The problem formalization as well as the proposed algorithm can be considered as an extension of our previous work toward a pipelined architecture. The experimental results show that the proposed method is quite effective and efficient.
The most creative tasks in synthesizing pipelined data paths executing software descriptions are determinations of latency and stage of pipeline, operation scheduling and hardware allocation. They are interrelated closely and depend on each other; thus finding its optimal solution has been a hard problem so far. By using simulated annealing methodology, these three tasks can be formulated as a three dimensional placement problem of operations in stage, time step and functional units space. This paper presents an efficient method based on simulated annealing to provide excellent solutions to the problem of not only the determinations of latency and stage of pipeline, operation scheduling and hardware allocation simultaneously, but also the pipelined data path synthesis under the constraints of performance or hardware cost. It is able to find a near optimal latency and stage of pipeline, an operation schedule and a hardware allocation in a reasonable time, while effectively exploring the existing tradeoffs in the design space.
We describe a formal verification algorithm for pipelined processors. This algorithm proves the equivalence between a processor's design and its specifications by using rewriting of recursive functions and a new type of mathematical induction: extended recursive induction. After the user indicates only selectors in the design, this algorithm can automatically prove processors having more than 10(1010) states. The algorithm is manuary applied to benchmark processors with pipelined control, and we discuss how data width, memory size, and the numbers of pipeline stages and instructions influence the computation cost of proving the correctness of the processors. Further, this algorithm can be used to generate a pipeline invariant.
Nguyen Ngoc BINH Masaharu IMAI Akichika SHIOMI Nobuyuki HIKICHI Yoshimichi HONMA Jun SATO
In this paper we describe the formal conditions to detect and resolve all kinds of pipeline data hazards and propose a scheduling algorithm for pipelined instruction set processor synthesis. The algorithm deals with multi cycle operations and tries to minimize the pipeline execution cycles under a given hardware configuration with/without hardware interlock. The main feature that makes the proposed algorithm different from existing ones is the algorithm is for estimating the performance in HW/SW partitioning, with capability of handling a module library of different FUs and dealing with multi cycle operations to be implemented in software. Experimental results of application to ASIP HW/SW codesign show that the proposed algorithm is effective and considerable pipeline execution cycle reduction rates can be achieved. The time complexity of the scheduing algorithm is of O(n2) in the worst case, where n is the number of instructions in a given basic block.
The newly developed high speed DRAMs are introduced and their innovative circuit techniques for achieving a high data bandwidth are described; the synchronous DRAM, the cache DRAM and the Rambus DRAM. They are all designed to fill the performance gap between MPUs and the main memory of computer systems, which will diverge in '90s. Although these high speed DRAMs have the same purpose to increase the data bandwidth, their approaches to accomplish it is different, which may in turn lead to some advantages or disadvantages as well as their fields of applications. The paper is intended not only to discuss them from technical overview, but also to be a guide to DRAM users when choosing the best fitting one for their systems.
Hironori YAMAUCHI Tetsuo MOROSAWA Takashi WATANABE Atsushi IWATA Tsutomu HOSAKA
Three custom LSIs for EB60, a direct wafer exposure electron beam system, have been developed using 0.8 µm BiCMOS and SST bipolar technologies. The three LSIs are i) a shot cycle control LSI for controlling each exposure cycle time, ii) a linear matrix computation LSI for coordinate modification of the exposure pattern data, and iii) a position calculation LSI for determining the precise position of the wafer. These LSIs allow the deflection corrector block of the revised EB60 to be realized on a single board. A new adaptive pipeline control technique which optimizes each shot period according to the exposure data is implemented in the shot-cycle control LSI. The position calculation LSI implements a new, highly effective 2-level pipeline exposure technique, the levels refer to major-field-deflection and minor-field-deflection. The linear-matrix computation LSI is designed not only for the EB60 but also for a wide variety of parallel digital processing applications.
Norihiko TANAKA Takakazu KUROKAWA Takashi MATSUBARA Yoshiaki KOGA
This paper proposes a new fault tolerant intercommunication scheme for real-time operations and three new interconnection networks to construct a fault tolerant multi-processor system for pipeline processings. The proposed intercommunication scheme using bank memory switching technique has an advantage to make a fault tolerant pipeline system so that it can detect any failure caused in a processing element of the system. In addition, it can overcome conventional problems caused in interconnection circuits to flow data with one way direction such as a pipeline processing.
Yoshio HIROSE Hideaki ANBUTSU Koichi YAMASHITA Gensuke GOTO
This paper describes a VLSI processor architecture designed for a back-propagation accelerator. Three techniques are used to accelerate the simulation. The first is a multi-processor approach where a neural network simulation is suitable for parallel processing. By constructing a ring network using several processors, the simulation speed is multiplied by the number of the processors. The second technique is internal parallel processing. Each processor contains 4 multipliers and 4 ALUs that all work in parallel. The third technique is pipelining. The connections of eight functional units change according to the current stage of the back-propagation algorithm. Intermediate data is sent from one functional unit to another without being stored in extra registers and data is processed in a pipeline manner. The data is in 24-bit floating point format (18-bit mantissa and 6-bit oxponent). The chip has about 88,000 gates, including microcode ROM for processor control, the processor is designed using 0.8-µm CMOS gate arrays, and the estimated performance at 40 MHz is 20 million connection updates per second (MCUPS). For a ring network with 4 processors, performance can be enhanced up to 90 MCUPS.