1-11hit |
Ryota SHIOYA Naruki KURATA Takashi TOYOSHIMA Masahiro GOSHIMA Shuichi SAKAI
Object-oriented languages have recently become common, making register indirect jumps more important than ever. In object-oriented languages, virtual functions are heavily used because they improve programming productivity greatly. Virtual function calls usually consist of register indirect jumps, and consequently, programs written in object-oriented languages contain many register indirect jumps. The prediction of the targets of register indirect jumps is more difficult than the prediction of the direction of conditional branches. Many predictors have been proposed for register indirect jumps, but they cannot predict the jump targets with high accuracy or require very complex hardware. We propose a method that resolves jump targets by forwarding execution results. Our proposal dynamically finds the producers of register indirect jumps in virtual function calls. After the execution of the producers, the execution results are forwarded to the processor's front-end. The jump targets can be resolved by the forwarded execution results without requiring prediction. Our proposal improves the performance of programs that include unpredictable register indirect jumps, because it does not rely on prediction but instead uses actual execution results. Our evaluation shows that the IPC improvement using our proposal is as high as 5.4% on average and 9.8% at maximum.
Junji YAMADA Ushio JIMBO Ryota SHIOYA Masahiro GOSHIMA Shuichi SAKAI
The region that includes the register file is a hot spot in high-performance cores that limits the clock frequency. Although multibanking drastically reduces the area and energy consumption of the register files of superscalar processor cores, it suffers from low IPC due to bank conflicts. Our skewed multistaging drastically reduces not the bank conflict probability but the pipeline disturbance probability by the second stage. The evaluation results show that, compared with NORCS, which is the latest research on a register file for area and energy efficiency, a proposed register file with 18 banks achieves a 39.9% and 66.4% reduction in circuit area and in energy consumption, while maintaining a relative IPC of 97.5%.
Ryota SHIOYA Daewung KIM Kazuo HORIO Masahiro GOSHIMA Shuichi SAKAI
A security-tagged architecture is one that applies tags on data to detect attack or information leakage, tracking data flow. The previous studies using security-tagged architecture mostly focused on how to utilize tags, not how the tags are implemented. A naive implementation of tags simply adds a tag field to every byte of the cache and the memory. Such a technique, however, results in a huge hardware overhead. This paper proposes a low-overhead tagged architecture. We achieve our goal by exploiting some properties of tag, the non-uniformity and the locality of reference. Our design includes the use of uniquely designed multi-level table and various cache-like structures, all contributing to exploit these properties. Under simulation, our method was able to limit the memory overhead to 0.685%, where a naive implementation suffered 12.5% overhead.
Michihiro AOKI Miki HIRANO Nobuaki MATSUURA Takashi KURIMOTO Takashi MIYAMURA Masahiro GOSHIMA Keisuke KABASHIMA Shigeo URUSHIDANI
The growth in the volume of Internet traffic and the increasing variety of Internet applications require Internet backbone networks to be scalable and provided sophisticated quality of service (QoS) capabilities. Internet backbone routers have evolved to achieve sub-Tbps switching capacity in a single unit, but their switch architectures have limited scalability, causing QoS to degrade as the switches get bigger. Hence, we propose a large-scale IP and lambda integrated router architecture with scalable switches. We first describe the system architecture of our proposed backbone router and clarify the requirements for its switching capabilities to meet near-future demands. The new switch architecture, using crossbar-based switching fabrics and optical interconnection devices, meets the requirements for a backbone router to scale up to 82 Tbps and enable light path switching as well as packet switching. The routing tag and its usage algorithm in the switch, and packaging issues, including the quantity of hardware required for expansion, are also discussed.
MinSeong CHOI Takashi FUKUDA Masahiro GOSHIMA Shuichi SAKAI
The time taken for processor simulation can be drastically reduced by selecting simulation points, which are dynamic sections obtained from the simulation result of processors. The overall behavior of the program can be estimated by simulating only these sections. The existing methods to select simulation points, such as SimPoint, used for selecting simulation points are deductive and based on the idea that dynamic sections executing the same static section of the program are of the same phase. However, there are counterexamples for this idea. This paper proposes an inductive method, which selects simulation points from the results obtained by pre-simulating several processors with distinctive microarchitectures, based on assumption that sections in which all the distinctive processors have similar istructions per cycle (IPC) values are of the same phase. We evaluated the first 100G instructions of SPEC 2006 programs. Our method achieved an IPC estimation error of approximately 0.1% by simulating approximately 0.05% of the 100G instructions.
Shin-ichiro MORI Tomoaki TSUMURA Masahiro GOSHIMA Yasuhiko NAKASHIMA Hiroshi NAKASHIMA Shinji TOMITA
This paper describes the architecture of ReVolver/C40 a scalable parallel machine for volume rendering and its prototype implementation. The most important feature of ReVolver/C40 is view-independent real time rendering of translucent 3D object by using perspective projection. In order to realize this feature, the authors propose a parallel volume memory architecture based on the principal axis oriented sampling method and parallel treble volume memory. This paper also discusses the implementation issues of ReVolver/C40 where various kinds of parallelism extracted to achieve high-perfromance rendering are explained. The prototype systems had been developed and their performance evaluation results are explained. As the results of the evaluation of the prototype systems, ReVolver/C40 with 32 parallel volume memory is estimated to achieve more than 10 frame per second for 2563 volume data on 2562 screen by using perspective projection. The authors also review the development of ReVolver/C40 from several view points.
Shuichi SAKAI Masahiro GOSHIMA Hidetsugu IRIE
This paper presents the processor architecture which provides much higher level dependability than the current ones. The features of it are: (1) fault tolerance and secure processing are integrated into a modern superscalar VLSI processor; (2) light-weight effective soft-error tolerant mechanisms are proposed and evaluated; (3) timing errors on random logic and registers are prevented by low-overhead mechanisms; (4) program behavior is hidden from the outer world by proposed address translation methods; (5) information leakage can be avoided by attaching policy tags for all data and monitoring them for each instruction execution; (6) injection attacks are avoided with much higher accuracy than the current systems, by providing tag trackings; (7) the overall structure of the dependable processor is proposed with a dependability manager which controls the detection of illegal conditions and recovers to the normal mode; and (8) an FPGA-based testbed system is developed where the system clock and the voltage are intentionally varied for experiment. The paper presents the fundamental scheme for the dependability, elemental technologies for dependability and the whole architecture of the ultra dependable processor. After showing them, the paper concludes with future works.
Ryota SHIOYA Ryo TAKAMI Masahiro GOSHIMA Hideki ANDO
Out-of-order superscalar processors have high performance but consume a large amount of energy for dynamic instruction scheduling. We propose a front-end execution architecture (FXA) for improving the energy efficiency of out-of-order superscalar processors. FXA has two execution units: an out-of-order execution unit (OXU) and an in-order execution unit (IXU). The OXU is the execution core of a common out-of-order superscalar processor. In contrast, the IXU consists only of functional units and a bypass network only. The IXU is placed at the processor front end and executes instructions in order. The IXU functions as a filter for the OXU. Fetched instructions are first fed to the IXU, and the instructions are executed in order if they are ready to execute. The instructions executed in the IXU are removed from the instruction pipeline and are not executed in the OXU. The IXU does not include dynamic scheduling logic, and thus its energy consumption is low. Evaluation results show that FXA can execute more than 50% of the instructions by using IXU, thereby making it possible to shrink the energy-consuming OXU without incurring performance degradation. As a result, FXA achieves both high performance and low energy consumption. We evaluated FXA and compared it with conventional out-of-order/in-order superscalar processors after ARM big.LITTLE architecture. The results show that FXA achieves performance improvements of 7.4% on geometric mean in SPECCPU INT 2006 benchmark suite relative to a conventional superscalar processor (big), while reducing the energy consumption by 17% in the entire processor. The performance/energy ratio (the inverse of the energy-delay product) of FXA is 25% higher than that of a conventional superscalar processor (big) and 27% higher than that of a conventional in-order superscalar processor (LITTLE).
Ushio JIMBO Junji YAMADA Ryota SHIOYA Masahiro GOSHIMA
Timing fault detection techniques address the problems caused by increased variations on a chip, especially with dynamic voltage and frequency scaling (DVFS). The Razor flip-flop (FF) is a timing fault detection technique that employs double sampling by the main and shadow FFs. In order for the Razor FF to correctly detect a timing fault, not the main FF but the shadow FF must sample the correct value. The application of Razor FFs to static logic relaxes the timing constraints; however, the naive application of Razor FFs to dynamic precharged logic such as SRAM read circuits is not effective. This is because the SRAM precharge cannot start before the shadow FF samples the value; otherwise, the transition of the bitline of the SRAM stops and the value sampled by the shadow FF will be incorrect. Therefore, the detect period cannot overlap the precharge period. This paper proposes a novel application of Razor FFs to SRAM read circuits. Our proposal employs a conditional precharge according to the value of a bitline sampled by the main FF. This enables the detect period to overlap the precharge period, thereby relaxing the timing constraints. The additional circuit required by this method is simple and only needed around the sense amplifier, and there is no need for a clock delayed from the system clock. Consequently, the area overhead of the proposed circuit is negligible. This paper presents SPICE simulations of the proposed circuit. Our proposal reduces the minimum cycle time by 51.5% at a supply voltage of 1.1 V and the minimum voltage by 31.8% at cycle time of 412.5 ps.
Naruki KURATA Ryota SHIOYA Masahiro GOSHIMA Shuichi SAKAI
To eliminate CAMs from the load/store queues, several techniques to detect memory access order violation with hash filters composed of RAMs have been proposed. This paper proposes a technique with parallel counting Bloom filters (PCBF). A Bloom filter has extremely low false positive rates owing to multiple hash functions. Although some existing researches claim the use of Bloom filters, none of them make mention to multiple hash functions. This paper also addresses the problem relevant to the variety of access sizes of load/store instructions. The evaluation results show that our technique, with only 2720-bit Bloom filters, achieves a relative IPC of 99.0% while the area and power consumption are greatly reduced to 14.3% and 22.0% compared to a conventional model with CAMs. The filter is much smaller than usual branch predictors.
Junji YAMADA Ushio JIMBO Ryota SHIOYA Masahiro GOSHIMA Shuichi SAKAI
An 8-issue superscalar core generally requires a 24-port RAM for the register file. The area and energy consumption of a multiported RAM increase in proportional to the square of the number of ports. A register cache can reduce the area and energy consumption of the register file. However, earlier register cache systems suffer from lower IPC caused by register cache misses. Thus, we proposed the Non-Latency-Oriented Register Cache System (NORCS) to solve the IPC problem with a modified pipeline. We evaluated NORCS mainly from the viewpoint of microarchitecture in the original article, and showed that NORCS maintains almost the same IPC as conventional register files. Researchers in NVIDIA adopted the same idea for their GPUs. However, the evaluation was not sufficient from the viewpoint of LSI design. In the original article, we used CACTI to evaluate the area and energy consumption. CACTI is a design space exploration tool for cache design, and adopts some rough approximations. Therefore, this paper shows design of NORCS with FreePDK45, an open source process design kit for 45nm technology. We performed manual layout of the memory cells and arrays of NORCS, and executed SPICE simulation with RC parasitics extracted from the layout. The results show that, from a full-port register file, an 8-entry NORCS achieves a 75.2% and 48.2% reduction in area and energy consumption, respectively. The results also include the latency which we did not present in our original article. The latencies of critical path is 307ps and 318ps for an 8-entry NORCS and a conventional multiported register file, respectively, when the same two cycles are allocated to register file read.