Keyword Search Result

[Keyword] microprocessor(51hit)

1-20hit(51hit)

  • Non-Stop Microprocessor for Fault-Tolerant Real-Time Systems Open Access

    Shota NAKABEPPU  Nobuyuki YAMASAKI  

     
    PAPER

      Pubricized:
    2023/01/25
      Vol:
    E106-C No:7
      Page(s):
    365-381

    It is very important to design an embedded real-time system as a fault-tolerant system to ensure dependability. In particular, when a power failure occurs, restart processing after power restoration is required in a real-time system using a conventional processor. Even if power is restored quickly, the restart process takes a long time and causes deadline misses. In order to design a fault-tolerant real-time system, it is necessary to have a processor that can resume operation in a short time immediately after power is restored, even if a power failure occurs at any time. Since current embedded real-time systems are required to execute many tasks, high schedulability for high throughput is also important. This paper proposes a non-stop microprocessor architecture to achieve a fault-tolerant real-time system. The non-stop microprocessor is designed so as to resume normal operation even if a power failure occurs at any time, to achieve little performance degradation for high schedulability even if checkpoint creations and restorations are performed many times, to control flexibly non-volatile devices through software configuration, and to ensure data consistency no matter when a checkpoint restoration is performed. The evaluation shows that the non-stop microprocessor can restore a checkpoint within 5µsec and almost hide the overhead of checkpoint creations. The non-stop microprocessor with such capabilities will be an essential component of a fault-tolerant real-time system with high schedulability.

  • Reducing Energy Consumption of Wakeup Logic through Double-Stage Tag Comparison

    Yasutaka MATSUDA  Ryota SHIOYA  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2021/11/02
      Vol:
    E105-D No:2
      Page(s):
    320-332

    The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.

  • Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing Open Access

    Koki ISHIDA  Masamitsu TANAKA  Takatsugu ONO  Koji INOUE  

     
    INVITED PAPER

      Vol:
    E101-C No:5
      Page(s):
    359-369

    CMOS microprocessors are limited in their capacity for clock speed improvement because of increasing computing power, i.e., they face a power-wall problem. Single-flux-quantum (SFQ) circuits offer a solution with their ultra-fast-speed and ultra-low-power natures. This paper introduces our contributions towards ultra-high-speed cryogenic SFQ computing. The first step is to design SFQ microprocessors. From qualitatively and quantitatively evaluating past-designed SFQ microprocessors, we have found that revisiting the architecture of SFQ microprocessors and on-chip caches is the first critical challenge. On the basis of cross-layer discussions and analysis, we came to the conclusion that a bit-parallel gate-level pipeline architecture is the best solution for SFQ designs. This paper summarizes our current research results targeting SFQ microprocessors and on-chip cache architectures.

  • An Operating System Guided Fine-Grained Power Gating Control Based on Runtime Characteristics of Applications

    Atsushi KOSHIBA  Mikiko SATO  Kimiyoshi USAMI  Hideharu AMANO  Ryuichi SAKAMOTO  Masaaki KONDO  Hiroshi NAKAMURA  Mitaro NAMIKI  

     
    PAPER

      Vol:
    E99-C No:8
      Page(s):
    926-935

    Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.

  • RSFQ 4-bit Bit-Slice Integer Multiplier

    Guang-Ming TANG  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E99-C No:6
      Page(s):
    697-702

    A rapid single-flux-quantum (RSFQ) 4-bit bit-slice multiplier is proposed. A new systolic-like multiplication algorithm suitable for RSFQ implementation is developed. The multiplier is designed using the cell library for AIST 10-kA/cm2 1.0-µm fabrication technology (ADP2). Concurrent flow clocking is used to design a fully pipelined RSFQ logic design. A 4n×4n-bit multiplier consists of 2n+17 stages. For verifying the algorithm and the logic design, a physical layout of the 8×8-bit multiplier has been designed with target operating frequency of 50GHz and simulated. It consists of 21 stages and 11,488 Josephson junctions. The simulation results show correct operation up to 62.5GHz.

  • Performance of Dynamic Instruction Window Resizing for a Given Power Budget under DVFS Control

    Hideki ANDO  Ryota SHIOYA  

     
    PAPER-Computer System

      Pubricized:
    2015/11/12
      Vol:
    E99-D No:2
      Page(s):
    341-350

    Dynamic instruction window resizing (DIWR) is a scheme that effectively exploits both memory-level parallelism and instruction-level parallelism by configuring the instruction window size appropriately for exploiting each parallelism. Although a previous study has shown that the DIWR processor achieves a significant speedup, power consumption has not been explored. The power consumption is increased in DIWR because the instruction window resources are enlarged in memory-intensive phases. If the power consumption exceeds the power budget determined by certain requirements, the DIWR processor must save power and thus, the performance previously presented cannot be achieved. In this paper, we explore to what extent the DIWR processor can achieve improved performance for a given power budget, assuming that dynamic voltage and frequency scaling (DVFS) is introduced as a power saving technique. Evaluation results using the SPEC2006 benchmark programs show that the DIWR processor, even with a constrained power budget, achieves a speedup over the conventional processor over a wide range of given power budgets. At the most important power budget point, i.e., when the power a conventional processor consumes without any power constraint is supplied, DIWR achieves a 16% speedup.

  • A Perpetuum Mobile 32bit CPU on 65nm SOTB CMOS Technology with Reverse-Body-Bias Assisted Sleep Mode

    Koichiro ISHIBASHI  Nobuyuki SUGII  Shiro KAMOHARA  Kimiyoshi USAMI  Hideharu AMANO  Kazutoshi KOBAYASHI  Cong-Kha PHAM  

     
    PAPER

      Vol:
    E98-C No:7
      Page(s):
    536-543

    A 32bit CPU, which can operate more than 15 years with 220mAH Li battery, or eternally operate with an energy harvester of in-door light is presented. The CPU was fabricated by using 65nm SOTB CMOS technology (Silicon on Thin Buried oxide) where gate length is 60nm and BOX layer thickness is 10nm. The threshold voltage was designed to be as low as 0.19V so that the CPU operates at over threshold region, even at lower supply voltages down to 0.22V. Large reverse body bias up to -2.5V can be applied to bodies of SOTB devices without increasing gate induced drain leak current to reduce the sleep current of the CPU. It operated at 14MHz and 0.35V with the lowest energy of 13.4 pJ/cycle. The sleep current of 0.14µA at 0.35V with the body bias voltage of -2.5V was obtained. These characteristics are suitable for such new applications as energy harvesting sensor network systems, and long lasting wearable computers.

  • MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

    Yuya KORA  Kyohei YAMAGUCHI  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2014/09/22
      Vol:
    E97-D No:12
      Page(s):
    3110-3123

    Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP). However, simply enlarging the window resources lengthens the clock cycle time. Although pipelining the resources solves this problem, it in turn prevents instruction-level parallelism (ILP) from being exploited because issuing instructions requires multiple clock cycles. This paper proposed a dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP. Specifically, if the scheme predicts that MLP is available during execution, the instruction window is enlarged and the window resources are pipelined, thereby exploiting MLP. Conversely, if the scheme predicts that less MLP is available, that is, ILP is exploitable for improved performance, the instruction window is shrunk and the window resources are de-pipelined, thereby exploiting ILP. Our evaluation results using the SPEC2006 benchmark programs show that the proposed scheme achieves nearly the best performance possible with fixed-size resources. On average, our scheme realizes a performance improvement of 21% over that of a conventional processor, with additional cost of only 6% of the area of the conventional processor core or 3% of that of the entire processor chip. The evaluation results also show 8% better energy efficiency in terms of 1/EDP (energy-delay product).

  • Area-Efficient Microarchitecture for Reinforcement of Turbo Mode

    Shinobu MIWA  Takara INOUE  Hiroshi NAKAMURA  

     
    PAPER-Computer System

      Vol:
    E97-D No:5
      Page(s):
    1196-1210

    Turbo mode, which accelerates many applications without major change of existing systems, is widely used in commercial processors. Since time duration or powerfulness of turbo mode depends on peak temperature of a processor chip, reducing the peak temperature can reinforce turbo mode. This paper presents that adding small amount of hardware allows microprocessors to reduce the peak temperature drastically and then to reinforce turbo mode successfully. Our approach is to find out a few small units that become heat sources in a processor and to appropriately duplicate them for reduction of their power density. By duplicating the limited units and using the copies evenly, the processor can show significant performance improvement while achieving area-efficiency. The experimental result shows that the proposed method achieves up to 14.5% of performance improvement in exchange for 2.8% of area increase.

  • Delay Evaluation of Issue Queue in Superscalar Processors with Banking Tag RAM and Correct Critical Path Identification

    Kyohei YAMAGUCHI  Yuya KORA  Hideki ANDO  

     
    PAPER-Computer System

      Vol:
    E95-D No:9
      Page(s):
    2235-2246

    This paper evaluates the delay of the issue queue in a superscalar processor to aid microarchitectural design, where quick quantification of the complexity of the issue queue is needed to consider the tradeoff between clock cycle time and instructions per cycle. Our study covers two aspects. First, we introduce banking tag RAM, which comprises the issue queue, to reduce the delay. Unlike normal RAM, this is not straightforward, because of the uniqueness of the issue queue organization. Second, we explore and identify the correct critical path in the issue queue. In a previous study, the critical path of each component in the issue queue was summed to obtain the issue queue delay, but this does not give the correct delay of the issue queue, because the critical paths of the components are not connected logically. In the evaluation assuming 32-nm LSI technology, we obtained the delays of issue queues with eight to 128 entries. The process of banking tag RAM and identifying the correct critical path reduces the delay by up to 20% and 23% for 4- and 8-issue widths, respectively, compared with not banking tag RAM and simply summing the critical path delay of each component.

  • A Dynamic Continuous Signature Monitoring Technique for Reliable Microprocessors

    Makoto SUGIHARA  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    477-486

    Reliability issues such as a soft error and NBTI (negative bias temperature instability) have become a matter of concern as integrated circuits continue to shrink. It is getting more and more important to take reliability requirements into account even for consumer products. This paper presents a dynamic continuous signature monitoring (DCSM) technique for high reliable computer systems. The DCSM technique dynamically generates reference signatures as well as runtime ones during executing a program. The DCSM technique stores the generated signatures in a signature table, which is a small storage circuit in a microprocessor, unlike the conventional static continuous signature monitoring techniques and contributes to saving program or data memory space that stores the signatures. Our experiments showed that our DCSM technique protected 1.4-100.0% of executed instructions depending on the size of signature tables.

  • A Multi-Performance Processor for Reducing the Energy Consumption of Real-Time Embedded Systems

    Tohru ISHIHARA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2533-2541

    This paper proposes an energy efficient processor which can be used as a design alternative for the dynamic voltage scaling (DVS) processors in embedded system design. The processor consists of multiple PE (processing element) cores and a selective set-associative cache memory. The PE-cores have the same instruction set architecture but differ in their clock speeds and energy consumptions. Only a single PE-core is activated at a time and the other PE-cores are deactivated using clock gating and signal gating techniques. The major advantage over the DVS processors is a small overhead for changing its performance. The gate-level simulation demonstrates that our processor can change its performance within 1.5 microsecond and dissipates about 10 nano-joule while conventional DVS processors need hundreds of microseconds and dissipate a few micro-joule for the performance transition. This makes it possible to apply our multi-performance processor to many real-time systems and to perform finer grained and more sophisticated dynamic voltage control.

  • Register File Size Reduction through Instruction Pre-Execution Incorporating Value Prediction

    Yusuke TANAKA  Hideki ANDO  

     
    PAPER-Computer System

      Vol:
    E93-D No:12
      Page(s):
    3294-3305

    Two-step physical register deallocation (TSD) is an architectural scheme that enhances memory-level parallelism (MLP) by pre-executing instructions. Ideally, TSD allows exploitation of MLP under an unlimited number of physical registers, and consequently only a small register file is needed for MLP. In practice, however, the amount of MLP exploitable is limited, because there are cases where either 1) pre-execution is not performed; or 2) the timing of pre-execution is delayed. Both are due to data dependencies among the pre-executed instructions. This paper proposes the use of value prediction to solve these problems. This paper proposes the use of value prediction to solve these problems. Evaluation results using the SPECfp2000 benchmark confirm that the proposed scheme with value prediction for predicting addresses achieves equivalent IPC, with a smaller register file, to the previous TSD scheme. The reduction rate of the register file size is 21%.

  • Energy-Efficient Pre-Execution Techniques in Two-Step Physical Register Deallocation

    Kazunaga HYODO  Kengo IWAMOTO  Hideki ANDO  

     
    PAPER-Computer Systems

      Vol:
    E92-D No:11
      Page(s):
    2186-2195

    Instruction pre-execution is an effective way to prefetch data. We previously proposed an instruction pre-execution scheme, which we call two-step physical register deallocation (TSD). The TSD realizes pre-execution by exploiting the difference between the amount of instruction-level parallelism available with an unlimited number of physical registers and that available with an actual number of physical registers. Although previous TSD study has successfully improved performance, it still has an inefficient energy consumption. This is because attempts are made for instructions to be pre-executed as much as possible, independently of whether or not they can significantly contribute to load latency reduction, allowing for maximal performance improvement. This paper presents a scheme that improves the energy efficiency of the TSD by pre-executing only those instructions that have great benefit. Our evaluation results using the SPECfp2000 benchmark show that our scheme reduces the dynamic pre-executed instruction count by 76%, compared with the original scheme. This reduction saves 7% energy consumption of the execution core with 2% overhead. Performance degrades by 2%, compared with that of the original scheme, but is still 15% higher than that of the normal processor without the TSD.

  • Ultra Dependable Processor

    Shuichi SAKAI  Masahiro GOSHIMA  Hidetsugu IRIE  

     
    INVITED PAPER

      Vol:
    E91-C No:9
      Page(s):
    1386-1393

    This paper presents the processor architecture which provides much higher level dependability than the current ones. The features of it are: (1) fault tolerance and secure processing are integrated into a modern superscalar VLSI processor; (2) light-weight effective soft-error tolerant mechanisms are proposed and evaluated; (3) timing errors on random logic and registers are prevented by low-overhead mechanisms; (4) program behavior is hidden from the outer world by proposed address translation methods; (5) information leakage can be avoided by attaching policy tags for all data and monitoring them for each instruction execution; (6) injection attacks are avoided with much higher accuracy than the current systems, by providing tag trackings; (7) the overall structure of the dependable processor is proposed with a dependability manager which controls the detection of illegal conditions and recovers to the normal mode; and (8) an FPGA-based testbed system is developed where the system clock and the voltage are intentionally varied for experiment. The paper presents the fundamental scheme for the dependability, elemental technologies for dependability and the whole architecture of the ultra dependable processor. After showing them, the paper concludes with future works.

  • A Low-Power Instruction Issue Queue for Microprocessors

    Shingo WATANABE  Akihiro CHIYONOBU  Toshinori SATO  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    400-409

    Instruction issue queue is a key component which extracts instruction level parallelism (ILP) in modern out-of-order microprocessors. In order to exploit ILP for improving processor performance, instruction queue size should be increased. However, it is difficult to increase the size, since instruction queue is implemented by a content addressable memory (CAM) whose power and delay are much large. This paper introduces a low power and scalable instruction queue that replaces the CAM with a RAM. In this queue, instructions are explicitly woken up. Evaluation results show that the proposed instruction queue decreases processor performance by only 1.9% on average. Furthermore, the total energy consumption is reduced by 54% on average.

  • Bit-Serial Single Flux Quantum Microprocessor CORE

    Akira FUJIMAKI  Masamitsu TANAKA  Takahiro YAMADA  Yuki YAMANASHI  Heejoung PARK  Nobuyuki YOSHIKAWA  

     
    INVITED PAPER

      Vol:
    E91-C No:3
      Page(s):
    342-349

    We describe the development of single-flux-quantum (SFQ) microprocessors and the related technologies such as designing, circuit architecture, microarchitecture, etc. Since the microprocessors studied here aim for a general-purpose computing system, we employ the complexity-reduced (CORE) architecture in which the high-speed nature of the SFQ circuits is used not for increasing processor performance but for reducing the circuit complexity. The bit-serial processing is the most suitable way to realize the CORE architecture. We assembled all the best technologies concerning SFQ integrated circuits and designed the SFQ microprocessors, CORE1α, CORE1β, and CORE1γ. The CORE1β was made up of about 11000 Josephson junctions and successfully demonstrated. The peak performance reached 1400 million operations per second with a power consumption of 3.4 mW. We showed that the SFQ microprocessors had an advantage in a performance density to semiconductor's ones, which lead to the potential for constructing a high performance SFQ-circuit-based computing system.

  • Dynamic Reconfiguration of Cache Indexing in Embedded Processors

    Junhee KIM  Sung-Soo LIM  Jihong KIM  

     
    PAPER-VLSI Systems

      Vol:
    E90-D No:3
      Page(s):
    637-647

    Cache performance optimization is an important design consideration in building high-performance embedded processors. Unlike general-purpose microprocessors, embedded processors can take advantages of application-specific information in optimizing the cache performance. One of such examples is to use modified cache index bits (over conventional index bits) based on memory access traces from key target embedded applications so that the number of conflict misses can be reduced. In this paper, we present a novel fine-grained cache reconfiguration technique which allows an intra-program reconfiguration of cache index bits, thus better reflecting the changing characteristics of a program execution. The proposed technique, called dynamic reconfiguration of index bits (DRIB), dynamically changes cache index bits in the function level. This compiler-directed and fine-grained approach allows each function to be executed using its own optimal index bits with no additional hardware support. In order to avoid potential performance degradation by frequent cache invalidations from reconfiguring cache index bits, we describe an efficient algorithm for selecting target functions whose cache index bits are reconfigured. Our algorithm ensures that the number of cache misses reduced by DRIB outnumbers the number of cache misses increased from cache invalidations. We also propose a new cache architecture, Two-Level Indexing (TLI) cache, which further reduces the number of conflict misses by intelligently dividing indexing steps into two stages. Our experimental results show that the DRIP approach combined with the TLI cache reduces the number of cache misses by 35% over the conventional cache indexing technique.

  • A Bootstrapped Switch for nMOS Reversible Energy Recovery Logic for Low-Voltage Applications

    Seokkee KIM  Soo-Ik CHAE  

     
    LETTER-Electronic Circuits

      Vol:
    E89-C No:5
      Page(s):
    649-652

    In this paper, we describe a bootstrapped nMOS switch that is modified to reduce leakage current for nMOS reversible energy recovery logic (nRERL) [1]. Conventional bootstrapped switches are not suitable for nRERL because they have nonadiabatic loss due to leakage current that flows while boosted. Therefore, we lowered the gate voltage of the isolation transistor in each bootstrapped switch to reduce this leakage current. With detailed analysis and simulation, we determined the range of the bias voltage, in which the switches can transfer full-swing input signals. We implemented a simple 8-bit nRERL microprocessor into silicon and measured its energy consumption to confirm our analysis. For the supply voltage of 1.8 V and the operating frequency of 880 kHz, we found that the microprocessor consumed about 8.5 pJ/cycle for 1.3 V < Vbias <1.6 V, which was just about a half of its energy consumption when Vbias = 1.7 V.

  • Design Development of SPARC64 V Microprocessor

    Mariko SAKAMOTO  Akira KATSUNO  Aiichiro INOUE  Takeo ASAKAWA  Kuniki MORITA  Tsuyoshi MOTOKURUMADA  Yasunori KIMURA  

     
    INVITED PAPER

      Vol:
    E86-D No:10
      Page(s):
    1955-1965

    We developed a SPARC-V9 processor, the SPARC64 V. It has an operating frequency of 1.35 GHz and contains 191 million transistors fabricated using 0.13-µm CMOS technology with eight-layer copper metallization. SPECjbb2000 (CPU# 32) is 492683, highest on the market and 42% higher than the next highest system. SPEC CPU2000 performance is 858 for SPECint and 1228 for SPECfp. The processor is designed to provide the high system performance and high reliability required of enterprise server systems. It is also designed to address the performance requirements of high-performance computing. During our development of several generations of mainframe processors, we conducted many related experiments, and obtained enterprise server system (EPS) development skills, an understanding of EPS workload characteristics, and technology that provides high reliability, availability, and serviceability. We used those as bases of the new processor development. The approach quite effectively moves beyond differences between mainframe and SPARC systems. At the beginning of development and before the start of hardware design, we developed a software performance simulator so we could understand the performance impacts of created specifications, thereby enabling us to make appropriate decisions about hardware design. We took this approach to solve performance problems before tape-out and avoid spending additional time on design update and physical machine reconstruction. We were successful, completing the high-performance processor development on schedule and in a short time. This paper describes the SPARC64 V microprocessor and performance analyses for development of its design.

1-20hit(51hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.