IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] computer architecture(11hit)

1-11hit

A Lightweight Method to Evaluate Effect of Approximate Memory with Hardware Performance Monitors
Soramichi AKIYAMA

PAPER-Computer System

Pubricized:
2019/09/02
Vol:
E102-D No:12
Page(s):
2354-2365
The latency and the energy consumption of DRAM are serious concerns because (1) the latency has not improved much for decades and (2) recent machines have huge capacity of main memory. Device-level studies reduce them by shortening the wait time of DRAM internal operations so that they finish fast and consume less energy. Applying these techniques aggressively to achieve approximate memory is a promising direction to further reduce the overhead, given that many data-center applications today are to some extent robust to bit-flips. To advance research on approximate memory, it is required to evaluate its effect to applications so that both researchers and potential users of approximate memory can investigate how it affects realistic applications. However, hardware simulators are too slow to run workloads repeatedly with different parameters. To this end, we propose a lightweight method to evaluate effect of approximate memory. The idea is to count the number of DRAM internal operations that occur to approximate data of applications and calculate the probability of bit-flips based on it, instead of using heavy-weight simulators. The evaluation shows that our system is 3 orders of magnitude faster than cycle accurate simulators, and we also give case studies of evaluating effect of approximate memory to some realistic applications.
Evaluation of Register Number Abstraction for Enhanced Instruction Register Files
Naoki FUJIEDA Kiyohiro SATO Ryodai IWAMOTO Shuichi ICHIKAWA

PAPER-Computer System

Pubricized:
2018/03/14
Vol:
E101-D No:6
Page(s):
1521-1531
Instruction set randomization (ISR) is a cost-effective obfuscation technique that modifies or enhances the relationship between instructions and machine languages. An Instruction Register File (IRF), a list of frequently used instructions, can be used for ISR by providing the way of indirect access to them. This study examines the IRF that integrates a positional register, which was proposed as a supplementary unit of the IRF, for the sake of tamper resistance. According to our evaluation, with a new design for the contents of the positional register, the measure of tamper resistance was increased by 8.2% at a maximum, which corresponds to a 32.2% increase in the size of the IRF. The number of logic elements increased by the addition of the positional register was 3.5% of its baseline processor.
An Inductive Method to Select Simulation Points
MinSeong CHOI Takashi FUKUDA Masahiro GOSHIMA Shuichi SAKAI

PAPER-Architecture

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
2891-2900
The time taken for processor simulation can be drastically reduced by selecting simulation points, which are dynamic sections obtained from the simulation result of processors. The overall behavior of the program can be estimated by simulating only these sections. The existing methods to select simulation points, such as SimPoint, used for selecting simulation points are deductive and based on the idea that dynamic sections executing the same static section of the program are of the same phase. However, there are counterexamples for this idea. This paper proposes an inductive method, which selects simulation points from the results obtained by pre-simulating several processors with distinctive microarchitectures, based on assumption that sections in which all the distinctive processors have similar istructions per cycle (IPC) values are of the same phase. We evaluated the first 100G instructions of SPEC 2006 programs. Our method achieved an IPC estimation error of approximately 0.1% by simulating approximately 0.05% of the 100G instructions.
An IP Synthesizer for Limited-Resource DWT Processor
Lan-Rong DUNG

PAPER-System Level Design

Vol:
E87-A No:12
Page(s):
3047-3056
This paper presents a VLSI design methodology for the MAC-level DWT/IDWT processor based on a novel limited-resource scheduling algorithm. The r-split Fully-specified Signal Flow Graph (FSFG) of limited-resource FIR filtering has been developed for the scheduling of the MAC-level DWT/IDWT signal processing. Given a set of architecture constraints and DWT parameters, the scheduling algorithm can generate four scheduling matrices that drive the data path to perform the DWT computation. Because the memory for the inter-octave is considered with the register of FIR filter, the memory size is less than the traditional architecture. Besides, based on the limited-resource scheduling algorithm, an automated DWT processor synthesizer has been developed and generates constrained DWT processors in the form of silicon intelligent property (SIP). The DWT SIP can be embedded into a SOC or mapped to program codes for commercial off-the-shelf (COTS) DSP processors with programmable devices. As a result, it has been successfully proven that a variety of DWT SIPs can be efficiently realized by tuning the parameters and applied for signal processing applications.
Issue Queue Energy Reduction through Dynamic Voltage Scaling
Vasily G. MOSHNYAGA

PAPER-Low-Power Technologies

Vol:
E85-C No:2
Page(s):
272-278
With increased size and issue-width, instruction issue queue becomes one of the most energy consuming units in today's superscalar microprocessors. This paper presents a novel architectural technique to reduce energy dissipation of adaptive issue queue, whose functionality is dynamically adjusted at runtime to match the changing computational demands of instruction stream. In contrast to existing schemes, the technique exploits a new freedom in queue design, namely the voltage per access. Since loading capacitance operated in the adaptive queue varies in time, the clock cycle budget becomes inefficiently exploited. We propose to trade-off the unused cycle time with supply voltage, lowering the voltage level when the queue functionality is reduced and increasing it with the activation of resources in the queue. Experiments show that the approach can save up to 39% of the issue queue energy without large performance and area overhead.
Fast Precise Interrupt Handling without Associative Searching in Multiple Out-Of-Order Issue Processors
Sang-Joon NAM In-Cheol PARK Chong-Min KYUNG

PAPER-Computer Hardware and Design

Vol:
E82-D No:3
Page(s):
645-653
This paper presents a new approach to the precise interrupt handling problem in modern processors with multiple out-of-order issues. It is difficult to implement a precise interrupt scheme in the processors because later instructions may change the process states before their preceding instructions have completed. We propose a fast precise interrupt handling scheme which can recover the precise state in one cycle if an interrupt occurs. In addition, the scheme removes all the associative searching operations which are inevitable in the previous approaches. To deal with the renaming of destination registers, we present a new bank-based register file which is indexed by bank index tables containing the bank identifiers of renamed register entries. Simulation results based on the superscalar MIPS architecture show that the register file with 3 banks is a good trade-off between high performance and low complexity.
REMARC: Reconfigurable Multimedia Array Coprocessor
Takashi MIYAMORI Kunle OLUKOTUN

PAPER-Computer Hardware and Design

Vol:
E82-D No:2
Page(s):
389-397
This paper describes a new reconfigurable processor architecture called REMARC (Reconfigurable Multimedia Array Coprocessor). REMARC is a small array processor that is tightly coupled to a main RISC processor. It consists of a global control unit and 64 16-bit processors called nano processors. REMARC is designed to accelerate multimedia applications, such as video compression, decompression, and image processing. These applications typically use 8-bit or 16-bit data therefore, each nano processor has a 16-bit datapath that is much wider than those of other reconfigurable coprocessors. We have developed a programming environment for REMARC and several realistic application programs, DES encryption, MPEG-2 decoding, and MPEG-2 encoding. REMARC can implement various parallel algorithms which appear in these multimedia applications. For instance, REMARC can implement SIMD type instructions similar to multimedia instruction extensions for motion compensation of the MPEG-2 decoding. Furthermore, the highly pipelined algorithms, like systolic algorithms, which appear in motion estimation of the MPEG-2 encoding can also be implemented efficiently. REMARC achieves speedups ranging from a factor of 2.3 to 21.2 over the base processor which is a single issue processor or 2-issue superscalar processor. We also compare its performance with multimedia instruction extensions. Using more processing resources, REMARC can achieve higher performance than multimedia instruction extensions.
ASAver.1: An FPGA-Based Education Board for Computer Architecture/System Design
Hiroyuki OCHI Yoko KAMIDOI Hideyuki KAWABATA

PAPER

Vol:
E80-A No:10
Page(s):
1826-1833
This paper proposes a new approach that makes it possible for every undergraduate student to perform experiments of developing a Ipipelined RISC processor within limited time available for the course. The approach consists of 4 steps. At the first step, every student implements by himself/herself a pipelined RISC processor which is based on a given, very simple model; it has separate buses for instruction and data memory ("Harvard architecture") to avoid structural hazard, while it completely ignores data control hazards to make implementation easy. Although it is such a "defective" processor, we can test its functionality by giving object code containing sufficient amount of NOP instructions to avoid hazards. At the second step, NOP instructions are deleted and behavior of the developed processor is observed carefully to understand data and control hazards. At the third step, benchmark problems are provided, and every student challenges to improve its performance. Finally every student is requested to present how he/she improved the processor. This paper also describes a new educational FPGA board ASAver.1 which is useful for experiments from introductory class to computer architecture/system class. As a feasibility study, a 16-bit pipelined RISC processor "ASAP-O" has been developed which has eight 16-bit general purpose registers, a 16-bit program counter, and a zero flag, with 10 essential instructions.
A Supplementary Scheme for Reducing Cache Access Time
Jong-Hong BAE Chong-Min KYUNG

LETTER-Computer Hardware and Design

Vol:
E79-D No:4
Page(s):
385-387
Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.
Neural Network Multiprocessors Applied with Dynamically Reconfigurable Pipeline Architecture
Takayuki MORISHITA Iwao TERAMOTO

PAPER-Processors

Vol:
E77-C No:12
Page(s):
1937-1943
Processing elements (PEs) with a dynamically reconfigurable pipeline architecture allow the high-speed calculation of widely used neural model which is multi-layer perceptrons with the backpropagation (BP) learning rule. Its architecture that was proposed for a single chip is extended to multiprocessors' structure. Each PE holds an element of the synaptic weight matrix and the input vector. Multi-local buses, a swapping mechanism of the weight matrix and the input vector, and transfer commands between processor elements allow the implementation of neural networks larger than the physical PE array. Estimated peak performance by the measurement of single processor element is 21.2 MCPS in the evaluation phase and 8.0 MCUPS during the learning phase at a clock frequency of 50 MHz. In the model, multi-layer perceptrons with 768 neurons and 131072 synapses are trained by a BP learning rule. It corresponds to 1357 MCPS and 512 MCUPS with 64 processor elements and 32 neurons in each PE.
COACH:A Computer Aided Design Tool for Computer Architects
Hiroki AKABOSHI Hiroto YASUURA

PAPER

Vol:
E76-A No:10
Page(s):
1760-1769
A modern architect can not design high performance computer architecture without thinking all factors of performance from hardware level (logic/layout design) to system level (application programs, operating systems, and compilers). For computer architecture design, there are few practical CAD tools, which support design activities of the architect. In this paper, we propose a CAD tool, called COACH, for computer architecture design. COACH supports architecture design from hardware level to system level. To make a high-performance general purpose computer system, the architect evaluates system performance as well as hardware level performance. To evaluate hardware level performance accurately, logic/layout synthesis tools and simulator are used for evaluation. Logic/layout synthesis tools translate the architecture design into logic circuits and layout pattern and simulator is used to get accurate information on hardware level performance which consists of clock frequency, the number of transistors, power consumption, and so on. To evaluate system level performance, a compiler generator is introducd. The compiler generator generates a compiler of a programming language from the desripition of architecture design. The designed architecture is simulated in the behavior level with programs compiled by the compiler, and the architect can get information on system level performance which consists of program execution steps, etc. From both hardware level performance and system level performance, the architect can evaluate and revise his/her architecture, considering the architecture from hardware level to system level. In this paper, we propose a new design methodology which uses () logic/layout synthesis tools and simulators as tools for architecture design and () a compiler generator for system level evaluation. COACH, a CAD system based on the methodology, is discussed and a prototype of COACH is implemented. Using the design methodology, two processors are designed. The result of the designs shows that the proposed design methodology are effective in architecture design.