Takahiro KAKIMOTO Hiroyuki OCHI Takao TSUDA
As a design flow for low-power FPGA implementation, Datapath-Layout-Driven Design (DLDD) has been proposed. This letter reports the effect of DLDD for standard-cell-based ASIC implementation, and proposes necessary improvements. Experimental results shows that about 8.3% reduction of power dissipation is achieved in the best case.
Hiroyuki OCHI Shigeaki TAGASHIRA Satoshi FUJITA
In this paper, we propose a new localization scheme for wireless sensor networks consisting of a huge number of sensor nodes equipped with simple wireless communication devices such as wireless LAN and Bluetooth. The proposed scheme is based on the Point-In-Triangle (PIT) test proposed by He et al. The scheme is actually implemented by using Bluetooth devices of Class 2 standard, and the performance of the scheme is evaluated in an actual environment. The result of experiments indicates that the proposed scheme could realize a localization with an error of less than 2 m.
Tetsushi KATAYAMA Hiroyuki OCHI Takao TSUDA
Binary Decision Diagrams (BDDs) are graph representation of Boolean functions. In particular, Ordered BDDs (OBDDs) are useful in many situations, because they provide canonical representation and they are manipulated efficiently. BDD packages which automatically generate OBDDs have been developed, and they are now widely used in logic design area, including formal verification and logic synthesis. Synthesis of pass-transistor circuits is one of successful applications of such BDD packages. Pass-transistor circuits are generated from BDDs by mapping each node to a selector which consists of two or four pass transistors. If circuits are generated from smaller BDDs, generated circuits have smaller number of transistors and hence save chip area and power consumption. In this paper, more generic BDDs which have no restrictions in variable ordering and variable appearance count on its paths are called Generic BDDs (GBDDs), and an algorithm for generating GBDDs is proposed for the purpose of synthesis of pass-transistor circuits. The proposed algorithm consists of two steps. At the first step, parse trees (PTs) for given Boolean formulas are generated, where a PT is a directed tree representation of Boolean formula(s) and it consists of literal nodes and operation nodes. In this step, our algorithm attempts to reduce the number of literal nodes of PTs. At the second step, a GBDD is generated for the PTs using Concatenation Method, where Concatenation Method generates a GBDD by connecting GBDDs vertically. In this step, our algorithm attempts to share isomorphic subgraphs. In experiments on ISCAS'89 and MCNC benchmark circuits, our program successfully generated 32 GBDDs out of 680 single-output functions and 4 GBDDs out of 49 multi-output functions whose sizes are smaller than OBDDs. GBDD size is reduced by 23.1% in the best case compared with OBDD.
Kentaro NAKAHARA Shin'ichi KOUYAMA Tomonori IZUMI Hiroyuki OCHI Yukihiro NAKAMURA
Recently, reconfigurable devices are widely used in the fields of small amount production and trial production. They are also expected to be utilized in such mission-critical fields as space development, because system update and pseudo-repair can be achieved remotely by reconfiguring. However, in the case of conventional reconfigurable devices, configuration memory upsets caused by radiation and alpha particles reconfigure the device unpredictably, resulting in fatal system failures. Therefore, a reconfigurable device with high fault-tolerance against configuration upsets is required. In this paper, we propose an architecture of a fault-tolerant reconfigurable device that autonomously repairs configuration upsets by itself without interrupting system operations. The device consists of a 2D array of "Autonomous-Repair Cells" each of which repairs its upsets autonomously. The architecture has a scalability in fault tolerance; a finer-grained Autonomous-Repair Cell provides higher fault-tolerance. To determine the architecture, we analyze four autonomous repair techniques of the cell experimentally. Then, two autonomous repair techniques, simple multiplexing (S.M.) and memory multiplexing (M.M.), are applied; the former to programmable logics and the latter to cell-to-cell routing resources. Through evaluation, we show that proposed device achieves more than 10 years average lifetime against configuration upsets even in a severe situation such as a satellite orbit.
Tomonori IZUMI Shin'ichi KOUYAMA Hiroyuki OCHI Yukihiro NAKAMURA
This paper presents an approach of logic mapping into LUT-Array-Based PLD where Boolean functions in the form of the sum of generalized complex terms (SGCTs) can be mapped directly. While previous mapping approach requires predetermined variable ordering, our approach performs mapping and variable reordering simultaneously. For the purpose, we propose a directed acyclic graph based on the multiple-valued decision diagram (MDD) and an algorithm to construct the graph. Our algorithm generates candidates of SGCT expressions for each node in a bottom-up manner and selects the variables in the current level by evaluating the sizes of SGCT expressions directly. Experimental results show that our approach reduces the number of terms maximum to 71 percent for the MCNC benchmark circuits.
Takashi IMAGAWA Masayuki HIROMOTO Hiroyuki OCHI Takashi SATO
This paper proposes a reliability evaluation environment for coarse-grained reconfigurable architectures. This environment is designed so that it can be easily extended to different target architectures and applications by automating the generation of the simulation inputs such as HDL codes for fault injection and configuration information. This automation enables us to explore a huge design space in order to efficiently analyze area/reliability trade-offs and find the best solution. This paper also shows demonstrative examples of the design space exploration of coarse-grained reconfigurable architectures using the proposed environment. Through the demonstrations, we discuss relationship between coarse-grained architectures and reliability, which has not yet been addressed in existing literatures and show the feasibility of the proposed environment.
Recently, various efficient algorithms for solving combinatorial optimization problems using BDD-based set manipulation techniques have been developed. Minato proposed O-suppressed BDDs (ZBDDs) which is suitable for set manipulation, and it is utilized for various search problems. In terms of practical limits of space, however, there are still many search problems which are solved much better by using conventional branch-and-bound techniques than by using BDDs or ZBDDs, while the ability of conventional branch-and-bound approaches is limited by computation time. In this paper, an extension of APPLY operation, named APPRUNE (APply + PRUNE) operation, is proposed, which performs APPLY operation (ZBDD construction) and pruning simultaneously in order to reduce the required space for intermediate ZBDDs. As a prototype, a specific algorithm of APPRUNE operation is shown by assuming that the given condition for pruning is a threshold function, although it is expected that APPRUNE operation will be more effective if more sophisticated condition are considered. To reduce size of ZBDDs in intermediate steps, this paper also pay attention to the number of cared variables. As an application, an exact-minimization algorithm for generalized Reed-Muller expressions (GRMs) is implemented. From experimental results, it is shown that time and memory usage improved 8.8 and 3.4 times, respectively, in the best case using APPRUNE operation. Results on generating GRMs of exact-minimum number of not only product terms but also literals is also shown.
In order to apply formal design verification, it is necessary to describe formally and correctly the specification of the circuit under verification. Especially when we apply conventional OBDD-based logic comparison method for verifying combinational circuits, another correct" logic circuits or Boolean formulae must be given as the specification. It is desired to develop an efficient automatic design verification method which interprets specification that can be described easier. This paper provides a new verification method which is useful for combinational circuits such as arithmetic circuits. The proposed method efficiently verifies whether a designed circuit satisfies a specification given by recurrence equations. This enables us to describe easily an error-free specification for arithmetic circuits. To perform verification efficiently using an ordinary OBDD package, an efficient truth-value rotation algorithm is developed. The truthvalue rotation algorithm efficiently generates an OBDD representing f(x + 1 (mod 2n)) from a given OBDD representing f(x). By experiments on SPARC station 10 model 51, it takes 180 secs to generate an OBDD for designed circuit of 23-bit square function, and additional 60 secs is sufficient to finish verifying that it satisfies the specification given by recurrence equations.
Hiromitsu AWANO Hiroshi TSUTSUI Hiroyuki OCHI Takashi SATO
Random telegraph noise (RTN) is a phenomenon that is considered to limit the reliability and performance of circuits using advanced devices. The time constants of carrier capture and emission and the associated change in the threshold voltage are important parameters commonly included in various models, but their extraction from time-domain observations has been a difficult task. In this study, we propose a statistical method for simultaneously estimating interrelated parameters: the time constants and magnitude of the threshold voltage shift. Our method is based on a graphical network representation, and the parameters are estimated using the Markov chain Monte Carlo method. Experimental application of the proposed method to synthetic and measured time-domain RTN signals was successful. The proposed method can handle interrelated parameters of multiple traps and thereby contributes to the construction of more accurate RTN models.
Hiroshi YUASA Hiroshi TSUTSUI Hiroyuki OCHI Takashi SATO
We propose a novel acceleration scheme for Monte Carlo based statistical static timing analysis (MC-SSTA). MC-SSTA, which repeatedly executes ordinary STA using a set of randomly generated gate delay samples, is widely accepted as an accuracy reference. A large number of random samples, however, should be processed to obtain accurate delay distributions, and software implementation of MC-SSTA, therefore, takes an impractically long processing time. In our approach, a generalized hardware module, the STA processing element (STA-PE), is used for the delay evaluation of a logic gate, and netlist-specific information is delivered in the form of instructions from an SRAM. Multiple STA-PEs can be implemented for parallel processing, while a larger netlist can be handled if only a larger SRAM area is available. The proposed scheme is successfully implemented on Altera's Arria II GX EP2AGX125EF35C4 device in which 26 STA-PEs and a 624-port Mersenne Twister-based random number generator run in parallel at a 116 MHz clock rate. A speedup of far more than10 is achieved compared to conventional methods including GPU implementation.
Takashi IMAGAWA Masayuki HIROMOTO Hiroyuki OCHI Takashi SATO
Time redundancy is sometimes an only option for enhancing circuit reliability when the circuit area is severely restricted. In this paper, a time-redundant error-correction scheme, which is particularly suitable for coarse-grained reconfigurable arrays (CGRAs), is proposed. It judges the correctness of the executions by comparing the results of two identical runs. Once a mismatch is found, the second run is terminated immediately to start the third run, under the assumption that the errors tend to persist in many applications, for selecting the correct result in the three runs. The circuit area and reliability of the proposed method is compared with a straightforward implementation of time-redundancy and a selective triple modular redundancy (TMR). A case study on a CGRA revealed that the area of the proposed method is 1% larger than that of the implementation for the selective TMR. The study also shows the proposed scheme is up to 2.6x more reliable than the full-TMR when the persistent error is predominant.
This paper proposes 0-1-A-Ā LUT, a new programmable logic using atom switches, and a delay-optimal mapping algorithm for it. Atom switch is a non-volatile memory device of very small geometry which is fabricated between metal layers of a VLSI, and it can be used as a switch device of very small on-resistance and parasitic capacitance. While considerable area reduction of Look Up Tables (LUTs) used in conventional Field Programmable Gate Arrays (FPGAs) has been achieved by simply replacing each SRAM element with a memory element using a pair of atom switches, our 0-1-A-Ā LUT achieves further area and delay reduction. Unlike the conventional atom-switch-based LUT in which all k input signals are fed to a MUX, one of input signals is fed to the switch array, resulting area reduction due to the reduced number of inputs of the MUX from 2k to 2k-1, as well as delay reduction due to reduced fanout load of the input buffers. Since the fanout of this input buffers depends on the mapped logic function, this paper also proposes technology mapping algorithms to select logic function of fewer number of fanouts of input buffers to achieve further delay reduction. From our experiments, the circuit delay using our k-LUT is 0.94% smaller in the best case compared with using the conventional atom-switch-based k-LUT.
Kentaro NAKAHARA Shin'ichi KOUYAMA Tomonori IZUMI Hiroyuki OCHI Yukihiro NAKAMURA
Reconfigurable devices are expected to be utilized in such mission-critical fields as space development and undersea cables, because system updates and pseudo-repair can be achieved remotely by reconfiguring. However, conventional reconfigurable devices suffer from memory-bit upset caused by charged particles in space which results in fatal system problems. In this paper, we propose an architecture of a fault-tolerant reconfigurable device. The proposed device is divided into "autonomous-repair cells" with embedded control circuits. The autonomous-repair cell proposed in this paper is based on error detection and correction (EDAC) and uses hardware and time redundancy. From evaluation, it is shown that the proposed architecture achieves sufficient reliability against configuration memory upset. Trade-offs between performance and cost are also analyzed.
Hiroyuki OCHI Tatsuya SUZUKI Sayaka MATSUNAGA Yoichi KAWANO Takao TSUDA
Floating-point units (FPUs) are indispensable in processors, 3D-graphic engines, etc. To improve design productivity of these LSIs, FPU IPs are strongly desired. However, it is impossible to cover wide range of needs by an FPU IP, because there are various kind of options in specifications (e.g., operating frequency, latency, and ability of pipeline operation) and implementations (e.g., hardware algorithms). Thus, multiple IPs are needed even for the same functionality. In this paper, we propose to build an IP Library which consists of large number of FPU IPs with various kind of specifications and implementations, and which has catalogue data that shows not only specifications but also post-layout area and power dissipation of each IP. As the first step of the project, we have developed an IP Library targeted to Rohm 0.35 µm triple-metal process, which consists of 20 IPs for IEEE-754-standard single-precision floating-point division with 5 operating frequencies (50 MHz, 75 MHz, 100 MHz, 125 MHz, and 150 MHz), with two options whether pipelined or not, and with two hardware algorithms (the restoring method and the SRT method). We have also developed a catalogue for the IP Library, which shows post-layout area and power dissipation as well as specification of each IP. We have introduced two metrics "performance-area ratio (MFLOPS/mm2)" and "performance-power ratio (MFLOPS/W)" to afford a good insight into efficiency of implementations. From the catalogue data, the restoring method is, on the average, 1.4 times and 2.3 times better than the SRT method in terms of performance-area ratio and performance-power ratio, respectively. The developed catalogue is usable not only for selection of the optimal IP for a specific application, but also for quantitative analysis at the early stage of architecture design. It is also expected that the catalogue data based on an actual process technology is valuable for education.
Hiroyuki OCHI Yoko KAMIDOI Hideyuki KAWABATA
This paper proposes a new approach that makes it possible for every undergraduate student to perform experiments of developing a Ipipelined RISC processor within limited time available for the course. The approach consists of 4 steps. At the first step, every student implements by himself/herself a pipelined RISC processor which is based on a given, very simple model; it has separate buses for instruction and data memory ("Harvard architecture") to avoid structural hazard, while it completely ignores data control hazards to make implementation easy. Although it is such a "defective" processor, we can test its functionality by giving object code containing sufficient amount of NOP instructions to avoid hazards. At the second step, NOP instructions are deleted and behavior of the developed processor is observed carefully to understand data and control hazards. At the third step, benchmark problems are provided, and every student challenges to improve its performance. Finally every student is requested to present how he/she improved the processor. This paper also describes a new educational FPGA board ASAver.1 which is useful for experiments from introductory class to computer architecture/system class. As a feasibility study, a 16-bit pipelined RISC processor "ASAP-O" has been developed which has eight 16-bit general purpose registers, a 16-bit program counter, and a zero flag, with 10 essential instructions.
Junya KAWASHIMA Hiroshi TSUTSUI Hiroyuki OCHI Takashi SATO
We investigate a design strategy for subthreshold circuits focusing on energy-consumption minimization and yield maximization under process variations. The design strategy is based on the following findings related to the operation of low-power CMOS circuits: (1) The minimum operation voltage (VDDmin) of a circuit is dominated by flip-flops (FFs), and VDDmin of an FF can be improved by upsizing a few key transistors, (2) VDDmin of an FF is stochastically modeled by a log-normal distribution, (3) VDDmin of a large circuit can be efficiently estimated by using the above model, which eliminates extensive Monte Carlo simulations, and (4) improving VDDmin may substantially contribute to decreasing energy consumption. The effectiveness of the proposed design strategy has been verified through circuit simulations on various circuits, which clearly show the design tradeoff between voltage scaling and transistor sizing.
Yuki ABE Kazutoshi KOBAYASHI Jun SHIOMI Hiroyuki OCHI
Energy harvesting has been widely investigated as a potential solution to supply power for Internet of Things (IoT) devices. Computing devices must operate intermittently rather than continuously, because harvested energy is unstable and some of IoT applications can be periodic. Therefore, processors for IoT devices with intermittent operation must feature a hibernation mode with zero-standby-power in addition to energy-efficient normal mode. In this paper, we describe the layout design and measurement results of a nonvolatile standard cell memory (NV-SCM) and nonvolatile flip-flops (NV-FF) with a nonvolatile memory using Fishbone-in-Cage Capacitor (FiCC) suitable for IoT processors with intermittent operations. They can be fabricated in any conventional CMOS process without any additional mask. NV-SCM and NV-FF are fabricated in a 180nm CMOS process technology. The area overhead by nonvolatility of a bit cell are 74% in NV-SCM and 29% in NV-FF, respectively. We confirmed full functionality of the NV-SCM and NV-FF. The nonvolatile system using proposed NV-SCM and NV-FF can reduce the energy consumption by 24.3% compared to the volatile system when hibernation/normal operation time ratio is 500 as shown in the simulation.
Hiroki SUGANO Hiroyuki OCHI Yukihiro NAKAMURA Ryusuke MIYAMOTO
Recently, many researchers tackle accurate object recognition algorithms and many algorithms are proposed. However, these algorithms have some problems caused by variety of real environments such as a direction change of the object or its shading change. The new tracking algorithm, Cascade Particle Filter, is proposed to fill such demands in real environments by constructing the object model while tracking the objects. We have been investigating to implement accurate object recognition on embedded systems in real-time. In order to apply the Cascade Particle Filter to embedded applications such as surveillance, automotives, and robotics, a hardware accelerator is indispensable because of limitations in power consumption. In this paper we propose a hardware implementation of the Discrete AdaBoost algorithm that is the most computationally intensive part of the Cascade Particle Filter. To implement the proposed hardware, we use PICO Express, a high level synthesis tool provided by Synfora, for rapid prototyping. Implementation result shows that the synthesized hardware has 1,132,038 transistors and the die area is 2,195 µm 1,985 µm under a 0.180 µm library. The simulation result shows that total processing time is about 8.2 milliseconds at 65 MHz operation frequency.
In this paper, an exact-minimization method for an AND-EXOR expression (ESOP) using O-suppressed binary decision diagrams (ZBDDs) is considered. The proposed method is an improvement of Sasao's MRCF-based method. From experimental results, it is shown that required ZBDD size is reduced to 1/3 in the best case compared with the MRCF-based method.
Takashi KISHIMOTO Wataru TAKAHASHI Kazutoshi WAKABAYASHI Hiroyuki OCHI
In this paper, we propose a novel placement algorithm for mixed-grained reconfigurable architectures (MGRAs). MGRA consists of coarse-grained and fine-grained clusters, in order to implement a combined digital systems of high-speed data paths with multi-bit operands and random logic circuits for state machines and bit-wise operations. For accelerating simulated annealing based FPGA placement algorithm, range limiter has been proposed to control the distance of two blocks to be interchanged. However, it is not applicable to MGRAs due to the heterogeneous structure of MGRAs. Proposed range limiter using connection bounding box effectively keeps the size of range limiter to encourage moves across fine-grain blocks in non-adjacent clusters. From experimental results, the proposed method achieved 47.8% reduction of cost in the best case compared with conventional methods.