IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] compiler(58hit)

21-40hit(58hit)

Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP
Makoto ISHIHARA Hiroki HONDA Mitsuhisa SATO

PAPER-Parallel/Distributed Programming Models, Paradigms and Tools

Vol:
E89-D No:2
Page(s):
399-407
iPat/OMP is an interactive parallelization assistance tool for OpenMP. In the present paper, we describe the design concept of iPat/OMP, the parallelization sequence achieved by the tool and its current implementation status. In addition, we present an evaluation of the performance of the implemented functionalities. The experimental results show that iPat/OMP can detect parallelism and create an appropriate OpenMP directive for several for-loops.
Message Scheduling for Irregular Data Redistribution in Parallelizing Compilers
Hui WANG Minyi GUO Daming WEI

PAPER-Parallel/Distributed Programming Models, Paradigms and Tools

Vol:
E89-D No:2
Page(s):
418-424
In parallelizing compilers on distributed memory systems, distributions of irregular sized array blocks are provided for load balancing and irregular problems. The irregular data redistribution is different from the regular block-cyclic redistribution. This paper is devoted to scheduling message for irregular data redistribution that attempt to obtain suboptimal solutions while satisfying the minimal communication costs condition and the minimal step condition. Based on the list scheduling, an efficient algorithm is developed and its experimental results are compared with previous algorithms. The improved list algorithm provides more chance for conflict messages in its relocation phase, since it allocates conflict messages through methods used in a divide-and-conquer algorithm and a relocation algorithm proposed previously. The method of selecting the smallest relocation cost guarantees that the improved list algorithm is more efficient than the other two in average.
Unified Phase Compiler by Use of 3-D Representation Space
Takefumi MIYOSHI Nobuhiko SUGINO

PAPER

Vol:
E88-A No:4
Page(s):
838-845
A novel unified phase compiler framework for embedded VLIWs and DSPs is shown. In this compiler, a given program is represented in 3-D representation space, which enables quantitatively estimating required resources and elapsed time. Transformation of a 3-D representation graph that corresponds to a code optimization method for a specific processor architecture is also proposed. The proposal compiler and the code optimization methods are compared with an ordinary compiler in terms of their generated codes. The results demonstrate their effectiveness.
Impacts of Compiler Optimizations on Address Bus Energy: An Empirical Study
Hiroyuki TOMIYAMA

LETTER-VLSI Design Technology and CAD

Vol:
E87-A No:10
Page(s):
2815-2820
Energy consumption is one of the most critical constraints in the design of portable embedded systems. This paper describes an empirical study about the impacts of compiler optimizations on the energy consumption of the address bus between processor and instruction memory. Experiments using a number of real-world applications are presented, and the results show that transitions on the instruction address bus can be significantly reduced (by 85% on the average) by the compiler optimizations together with bus encoding.
Dynamic Code Repositioning for Java
Shinji TANAKA Tetsuyasu YAMADA Satoshi SHIRAISHI

PAPER-Software Support and Optimization Techniques

Vol:
E87-D No:7
Page(s):
1737-1742
The sizes of recent Java-based server-side applications, like J2EE containers, have been increasing continuously. Past techniques for improving the performance of Java applications have targeted relatively small applications. Moreover, when the methods of these small target applications are invoked, they are not usually distributed over the entire memory space. As a result, these techniques cannot be applied efficiently to improve the performance of current large applications. We propose a dynamic code repositioning approach to improve the hit rates of instruction caches and translation look-aside buffers. Profiles of method invocations are collected when the application performs with its heaviest processor load, and the code is repositioned based on these profiles. We also discuss a method-splitting technique to significantly reduce the sizes of methods. Our evaluation of a prototype implementing these techniques indicated 5% improvement in the throughput of the application.
A Low-Power Tournament Branch Predictor
Sung Woo CHUNG Gi Ho PARK Sung Bae PARK

LETTER-Computer Systems

Vol:
E87-D No:7
Page(s):
1962-1964
This letter proposes a low-power tournament branch predictor, in which the number of accesses to the branch predictors (local predictor or global predictor) is reduced. Analysis results with Samsung Memory Compiler show that the proposed branch predictor reduces the power consumption by 24-45%, compared to the conventional tournament branch predictor, not requiring any additional storage arrays, not incurring any additional delay and never harming accuracy.
Memory Data Organization for Low-Energy Address Buses
Hiroyuki TOMIYAMA Hiroaki TAKADA Nikil D. DUTT

PAPER

Vol:
E87-C No:4
Page(s):
606-612
Energy consumption has become one of the most critical constraints in the design of portable multimedia systems. For media applications, address buses between processor and data memory consume a considerable amount of energy due to their large capacitance and frequent accesses. This paper studies impacts of memory data organization on the address bus energy. Our experiments show that the address bus activity is significantly reduced by 50% through exploring memory data organization and encoding address buses.
Bit Length Optimization of Fractional Part on Floating to Fixed Point Conversion for High-Level Synthesis
Nobuhiro DOI Takashi HORIYAMA Masaki NAKANISHI Shinji KIMURA Katsumasa WATANABE

PAPER-Logic and High Level Synthesis

Vol:
E86-A No:12
Page(s):
3184-3191
In the hardware synthesis from a high-level language such as C, the bit length of variables is one of the key issues for the area and speed optimization. Usually, designers are required to optimize the bit-length of each variable manually using the time-consuming simulation on huge-data. In this paper, we propose an optimization method of the fractional bit length in the conversion from floating-point variables to fixed-point variables. The method is based on error propagation and the backward propagation of the accuracy limitation. The method is fully analytical and fast compared to simulation based methods.
Efficient Loop Partitioning for Parallel Codes of Irregular Scientific Computations
Minyi GUO

PAPER-Software Systems

Vol:
E86-D No:9
Page(s):
1825-1834
In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is ensured when executing that iteration. Then, after all iterations are partitioned into various processors, we give global vs. local data transformation rule, indirection arrays remapping and communication optimization methods. The experimental results show that, in most cases, our approaches achieved better performance than other loop partitioning rules.
Parallel Molecular Dynamics in a Parallelizing SML Compiler
Norman SCAIFE Ryoko HAYASHI Susumu HORIGUCHI

PAPER-Software Systems and Technologies

Vol:
E86-D No:9
Page(s):
1569-1576
We have constructed a parallelizing compiler for Standard ML (SML) based upon algorithmic skeletons. We present an implementation of a Parallel Molecular Dynamics (PMD) simulation in order to compare our functional approach with a traditional imperative approach. Although we present performance data, the principal benefits from our approach are in the modularity of the code and the ease of programming. Extant FORTRAN90 code for an O(N 2) algorithm is translated, firstly into imperative SML and then into purely functional SML which is then parallelized. The ease of programming and the performance of the FORTRAN90 and SML code are compared. Modest parallel performance is obtained from the parallel SML but with a much slower sequential execution time compared to the FORTRAN90. We then improve the implementation with a ring topology implementation which gives much closer performance to the FORTRAN90 implementation.
Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture
Keiji KIMURA Takeshi KODAKA Motoki OBATA Hironori KASAHARA

PAPER-Architecture and Algorithms

Vol:
E86-C No:4
Page(s):
570-579
This paper describes multigrain parallel processing on OSCAR (Optimally SCheduled Advanced multiprocessoR) chip multiprocessor architecture. OSCAR compiler cooperative chip multiprocessor architecture aims at development of scalable, high effective performance and cost effective chip multiprocessor with ease of use by compiler supports. OSCAR chip multiprocessor architecture integrates simple single issue processors having distributed shared data memory for optimal use of data locality over different loops and fine grain data transfer and synchronization, local data memory for private data recognized by compiler, and compiler controllable data transfer unit for overlapping data transfer to hide data transfer overhead. This OSCAR chip multiprocessor and OSCAR multigrain parallelizing compiler have been developed simultaneously. Performance of multigrain parallel processing on OSCAR chip multiprocessor architecture is evaluated using SPEC fp 2000/95 benchmark suite. When microSPARC like single issue core is used, OSCAR chip multiprocessor architecture gives us 2.36 times speedup in fpppp, 2.64 times in su2cor, 2.88 times in turb3d, 2.98 times in hydro2d, 3.84 times in tomcatv, 3.84 times in mgrid and 3.97 times in swim respectively for four processors against single processor.
A Compiler Generation Method for HW/SW Codesign Based on Configurable Processors
Shinsuke KOBAYASHI Kentaro MITA Yoshinori TAKEUCHI Masaharu IMAI

PAPER-Hardware/Software Codesign

Vol:
E85-A No:12
Page(s):
2586-2595
This paper proposes a compiler generation method for PEAS-III (Practical Environment for ASIP development), which is a configurable processor development environment for application domain specific embedded systems. Using the PEAS-III system, not only the HDL description of a target processor but also its target compiler can be generated. Therefore, execution cycles and dynamic power consumption can be rapidly evaluated. Two processors and their derivatives were designed using the PEAS-III system in the experiment. Experimental results show that the trade-offs among area, performance and power consumption of processors were analyzed in about twelve hours and the optimal processor was selected under the design constraints by using generated compilers and processors.
Loop and Address Code Optimization for Digital Signal Processors
Jong-Yeol LEE In-Cheol PARK

LETTER-Digital Signal Processing

Vol:
E85-A No:6
Page(s):
1408-1415
This paper presents a new DSP-oriented code optimization method to enhance performance by exploiting the specific architectural features of digital signal processors. In the proposed method, a source code is translated into the static single assignment form while preserving the high-level information related to loops and the address computation of array accesses. The information is used in generating hardware loop instructions and parallel instructions provided by most digital signal processors. In addition to the conventional control-data flow graph, a new graph is employed to make it easy to find auto-modification addressing modes efficiently. Experimental results on benchmark programs show that the proposed method is effective in improving performance.
Code Optimization Technique for Indirect Addressing DSPs with Consideration in Local Computational Order and Memory Allocation
Nobuhiko SUGINO Akinori NISHIHARA

PAPER-Implementations of Signal Processing Systems

Vol:
E84-A No:8
Page(s):
1960-1968
Digital signal processors (DSPs) usually employ indirect addressing using address registers (ARs) to indicate their memory addresses, which often introduces overhead codes in AR updates for next memory accesses. Reduction of such overhead code is one of the important issues in automatic generation of highly-efficient DSP codes. In this paper, a new automatic address allocation method incorpolated with computational order rearrangement at local commutative parts is proposed. The method formulates a given memory access sequence by a graph representation, where several strategies to handle freedom in memory access orders at the computational commutative parts are introduced and examined. A compiler scheme is also extended such that computational order at the commutative parts is rearranged according to the derived memory allocation. The proposed methods are applied to an existing DSP compiler for µPD77230(NEC), and codes generated for several examples are compared with memory allocations by the conventional methods.
Hardware Synthesis from C Programs with Estimation of Bit Length of Variables
Osamu OGAWA Kazuyoshi TAKAGI Yasufumi ITOH Shinji KIMURA Katsumasa WATANABE

PAPER

Vol:
E82-A No:11
Page(s):
2338-2346
In the hardware synthesis methods with high level languages such as C language, optimization quality of the compilers has a great influence on the area and speed of the synthesized circuits. Among hardware-oriented optimization methods required in such compilers, minimization of the bit length of the data-paths is one of the most important issues. In this paper, we propose an estimation algorithm of the necessary bit length of variables for this aim. The algorithm analyzes the control/data-flow graph translated from C programs and decides the bit length of each variable. On several experiments, the bit length of variables can be reduced by half with respect to the declared length. This method is effective not only for reducing the circuit area but also for reducing the delay of the operation units such as adders.
Fast Compiler Re-Targeting to Different Platforms by Translating at Intermediate Code Level
Norio SATO

PAPER-Communication Software

Vol:
E82-B No:6
Page(s):
923-935
The intermediate language (IL) modularizes a compiler into target processor independent and dependent parts, called the front-end and the back-end. By adding a new back-end, it is possible to port existing software from one processor to another. This paper presents a new efficient approach to achieve multiple targeting to quite different architectures using different processors as well, by translating from one IL into other existing ILs. This approach makes it possible to reuse existing back-ends. It has been successfully applied to a commercial-scale project for porting public switching system software. Since the target ILs were not predictable in advance, we provided an abstract syntax tree (AST) with attributes accessible by abstract data type (ADT) interface to convey the source language information from our front-end to back-ends. It was translated into several ILs that were developed independently. These translations made the compiler available in a very short time for different cross-target platforms and on several workstations we needed. The structure of this AST and the mapping to these ILs are presented, and retargeting cost is evaluated.
An Analysis for Fast Construction of States in the Bottom-Up Tree Pattern Matching Scheme
Kyung-Woo KANG Kwang-Moo CHOE Min-Soo JUNG

PAPER-Sofware System

Vol:
E82-D No:5
Page(s):
973-976
In this paper, we propose an efficient method of constructing states in bottom-up tree pattern matching with dynamic programming technique for optimal code generation. This method can be derived from precomputing the analysis which is needed for constructing states. The proposed scheme is more efficient than other scheme because we can avoid unfruitful tests in constructing states at compile time. Furthermore, the relevant analyses needed for this proposal are largely achieved at compile-compile time, which secures actual efficiency at compile time.
A Binding Algorithm for Retargetable Compilation to Non-orthogonal DSP Architectures
Masayuki YAMAGUCHI Nagisa ISHIURA Takashi KAMBE

PAPER-Compiler

Vol:
E81-A No:12
Page(s):
2630-2639
This paper presents a new binding algorithm for a retargetable compiler which can deal with diverse architectures of application specific embedded processors. The architectural diversity includes a "non-orthogonal" datapath configuration where all the registers are not equally accessible by all the functional units. Under this assumption, binding becomes a hard task because inadvertent assignment of an operation to a functional unit may rule out possible assignment of other operations due to unreachability among datapath resources. We propose a new BDD-based algorithm to solve this problem. While most of the conventional methods are based on the covering of expression trees obtained by decomposing DFGs, our algorithm works directly on the DFGs so as to avoid infeasible bindings. In the experiments, a feasible binding which satisfies the reachability is found or the deficiency of datapath is detected within a few seconds.
Language and Compiler for Optimizing Datapath Widths of Embedded Systems
Akihiko INOUE Hiroyuki TOMIYAMA Takanori OKUMA Hiroyuki KANBARA Hiroto YASUURA

PAPER-Co-design

Vol:
E81-A No:12
Page(s):
2595-2604
The datapath width of a core processor has a strong effect on cost, power consumption, and performance of an embedded system integrated with memories into a single-chip. However, it is difficult for designers to appropriately determine the datapath width for each application because of the limited reusability of software and the lack of compilation techniques. The purpose of this paper is to clarify supports required from software for the optimal datapath width determination. As a solution, an embedded programming language, called Valen-C, and a retargetable Valen-C compiler are proposed. In this paper, the syntax and semantics of Valen-C along with the mechanism of the Valen-C retargetable compiler and how to preserve the accuracy of computation of programs in relation to various datapath widths are also described. Experiments with practical applications show that the total cost of the system including a core processor, ROM, and RAM is drastically reduced with little performance loss by reducing the datapath width.
Instruction Scheduling to Reduce Switching Activity of Off-Chip Buses for Low-Power Systems with Caches
Hiroyuki TOMIYAMA Tohru ISHIHARA Akihiko INOUE Hiroto YASUURA

PAPER-Compiler

Vol:
E81-A No:12
Page(s):
2621-2629
In many embedded systems, a significant amount of power is consumed for off-chip driving because off-chip capacitances are much larger than on-chip capacitances. This paper proposes instruction scheduling techniques to reduce power consumed for off-chip driving. The techniques minimize the switching activity of a data bus between an on-chip cache and a main memory when instruction cache misses occur. The scheduling problem is formulated and two scheduling algorithms are presented. Experimental results demonstrate the effectiveness and the efficiency of the proposed algorithms.

21-40hit(58hit)

Keyword Search Result

[Keyword] compiler(58hit)

Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP

Message Scheduling for Irregular Data Redistribution in Parallelizing Compilers

Unified Phase Compiler by Use of 3-D Representation Space

Impacts of Compiler Optimizations on Address Bus Energy: An Empirical Study

Dynamic Code Repositioning for Java

A Low-Power Tournament Branch Predictor

Memory Data Organization for Low-Energy Address Buses

Bit Length Optimization of Fractional Part on Floating to Fixed Point Conversion for High-Level Synthesis

Efficient Loop Partitioning for Parallel Codes of Irregular Scientific Computations

Parallel Molecular Dynamics in a Parallelizing SML Compiler

Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture

A Compiler Generation Method for HW/SW Codesign Based on Configurable Processors

Loop and Address Code Optimization for Digital Signal Processors

Code Optimization Technique for Indirect Addressing DSPs with Consideration in Local Computational Order and Memory Allocation

Hardware Synthesis from C Programs with Estimation of Bit Length of Variables

Fast Compiler Re-Targeting to Different Platforms by Translating at Intermediate Code Level

An Analysis for Fast Construction of States in the Bottom-Up Tree Pattern Matching Scheme

A Binding Algorithm for Retargetable Compilation to Non-orthogonal DSP Architectures

Language and Compiler for Optimizing Datapath Widths of Embedded Systems

Instruction Scheduling to Reduce Switching Activity of Off-Chip Buses for Low-Power Systems with Caches

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles