Author Search Result

[Author] Hideharu AMANO(67hit)

1-20hit(67hit)

  • A Link Removal Methodology for Application-Specific Networks-on-Chip on FPGAs

    Daihan WANG  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-VLSI Systems

      Vol:
    E92-D No:4
      Page(s):
    575-583

    The regular 2-D mesh topology has been utilized for most of Network-on-Chips (NoCs) on FPGAs. Spatially biased traffic generated in some applications makes a customization method for removing links more efficient, since some links become low utilization. In this paper, a link removal strategy that customizes the router in NoC is proposed for reconfigurable systems in order to minimize the required hardware amount. Based on the pre-analyzed traffic information, links on which the communication amount is small are removed to reduce the hardware cost while maintaining adequate performance. Two policies are proposed to avoid deadlocks and they outperform up*/down* routing, which is a representative deadlock-free routing on irregular topology. In the case of the image recognition application susan, the proposed method can save 30% of the hardware amount without performance degradation.

  • FOREWORD Open Access

    Hideharu AMANO  

     
    FOREWORD

      Vol:
    E95-D No:12
      Page(s):
    2749-2749
  • A Retargetable Compiler Based on Graph Representation for Dynamically Reconfigurable Processor Arrays

    Vasutan TUNBUNHENG  Hideharu AMANO  

     
    PAPER-VLSI Systems

      Vol:
    E91-D No:11
      Page(s):
    2655-2665

    For developing design environment of various Dynamically Reconfigurable Processor Arrays (DRPAs), the Graph with Configuration Information (GCI) is proposed to represent configurable resource in the target dynamically reconfigurable architecture. The functional unit, constant unit, register, and routing resource can be represented in the graph as well as the configuration information. The restriction in the hardware is also added in the graph by limiting the possible configuration at a node controlled by the other node. A prototype compiler called Black-Diamond with GCI is now available for three different DRPAs. It translates data-flow graph from C-like front-end description, applies placement and routing by using the GCI, and generates configuration data for each element of the DRPA. Evaluation results of simple applications show that Black-Diamond can generate reasonable designs for all three different architectures. Other target architectures can be easily treated by representing many aspects of architectural property into a GCI.

  • A Survey on Dynamically Reconfigurable Processors Open Access

    Hideharu AMANO  

     
    INVITED PAPER

      Vol:
    E89-B No:12
      Page(s):
    3179-3187

    Dynamically reconfigurable processors are consisting of an array of processing elements whose functions and interconnections can be dynamically changed. 9 commercial systems are picked up, and their array structures, processing elements and interconnection architectures are classified.

  • A Generalized Theory Based on the Turn Model for Deadlock-Free Irregular Networks

    Ryuta KAWANO  Ryota YASUDO  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/10/08
      Vol:
    E103-D No:1
      Page(s):
    101-110

    Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.

  • Multi-Voltage Variable Pipeline Routers with the Same Clock Frequency for Low-Power Network-on-Chips Systems

    Akram BEN AHMED  Hiroki MATSUTANI  Michihiro KOIBUCHI  Kimiyoshi USAMI  Hideharu AMANO  

     
    PAPER

      Vol:
    E99-C No:8
      Page(s):
    909-917

    In this paper, the Multi-voltage (multi-Vdd) variable pipeline router is proposed to reduce the power consumption of Network-on-Chips (NoCs) designed for Chip Multi-processors (CMPs). The multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike Dynamic Voltage and Frequency Scaling (DVFS) routers, the operating frequency remains the same for all routers throughout the CMP; thus, omitting the need to synchronize neighboring routers working at different frequencies. Two types of router architectures are presented: a Coarse-Grained Variable Pipeline (CG-VP) router that changes the voltage supplied to the entire router, and a Fine-Grained Variable Pipeline (FG-VP) router that uses a finer power partition. The evaluation results showed that the CG-VP and FG-VP routers achieve a 22.9% and 35.3% power reduction on average with 14% and 23% area overhead in comparison with a baseline router without variable pipelines, respectively. Thanks to the adopted look-ahead mechanism to switch the supply voltage, the performance overhead is only 4.4%.

  • An Operating System Guided Fine-Grained Power Gating Control Based on Runtime Characteristics of Applications

    Atsushi KOSHIBA  Mikiko SATO  Kimiyoshi USAMI  Hideharu AMANO  Ryuichi SAKAMOTO  Masaaki KONDO  Hiroshi NAKAMURA  Mitaro NAMIKI  

     
    PAPER

      Vol:
    E99-C No:8
      Page(s):
    926-935

    Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.

  • MMLRU Selection Function: A Simple and Efficient Output Selection Function in Adaptive Routing

    Michihiro KOIBUCHI  Akiya JOURAKU  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E88-D No:1
      Page(s):
    109-118

    Adaptive routing algorithms, which dynamically select the route of a packet, have been widely studied for interconnection networks in massively parallel computers. An output selection function (OSF), which decides the output channel when some legal channels are free, is essential for an adaptive routing. In this paper, we propose a simple and efficient OSF called minimal multiplexed and least-recently-used (MMLRU). The MMLRU selection function has the following simple strategies for distributing the traffic: 1) each router locally grasps the congestion information by the utilization ratio of its own physical channels; 2) it is divided into the two selection steps, the choice from available physical channels and the choice from available virtual channels. The MMLRU selection function can be used on any type of network topology and adaptive routing algorithm. Simulation results show that the MMLRU selection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.

  • Performance Evaluation of Instruction Set Architecture of MBP-Light in JUMP-1

    Noriaki SUZUKI  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    1996-2005

    The instruction set architecture of MBP-light, a dedicated processor for the DSM (Distributed Shared Memory) management of JUMP-1 is analyzed with a real prototype. The Buffer-Register Architecture proposed for MBP-core improves performance with 5.64% in the home cluster and 6.27% in a remote cluster. Only a special instruction for hashing cluster address is efficient and improves the performance with 2.80%, but other special instructions are almost useless. It appears that the dominant operations in the DSM management program were handling packet queues assigned into the local cluster. Thus, common RISC instructions, especially load/store instructions, are frequently used. Separating instruction and data memory improves performance with 33%. The results suggest that another alternative which provides separate on-chip cache and instructions dedicated for packet queue management is advantageous.

  • The MDX (Multi-Dimensional X'bar): A Class of Networks for Large Scale Multiprocessors

    Atsushi MURATA  Taisuke BOKU  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E79-D No:8
      Page(s):
    1116-1123

    The recent advance of semiconductor technologies enable to produce a medium size of crossbar with reasonable cost. By making the best use of the high bandwidth of such crossbars, indirect networks including the base-m n-cube and HyperCross have been proposed and researched. In these networks, a node is connected other nodes through crossbars in multiple dimensions. Although these networks are practically used in commercial machines, almost no discussion on a class of networks including them has been done. In this paper, a network class called Multi-Dimensional X'bar (MDX) which includes the above two networks is defined. Several new networks in this class are proposed, and relationship between these networks and direct networks/multistage interconnection networks is discussed. Finally, routing methods for these new networks are proposed and the average distance is evaluated. Through the discussion and evaluation, the MDX supports higher bandwidth than the corresponding multistage interconnection network with smaller hardware than the corresponding direct network.

  • A Leakage Efficient Data TLB Design for Embedded Processors

    Zhao LEI  Hui XU  Daisuke IKEBUCHI  Tetsuya SUNATA  Mitaro NAMIKI  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E94-D No:1
      Page(s):
    51-59

    This paper presents a leakage efficient data TLB (Translation Look-aside Buffer) design for embedded processors. Due to the data locality in programs, data TLB references tend to hit only a small number of pages during short execution intervals. After dividing the overall execution time into smaller time slices, a leakage reduction mechanism is proposed to detect TLB entries which actually serve for virtual-to-physical address translations within each time slice. Thus, with the integration of the dual voltage supply technique, those TLB entries which are not used for address translations can be put into low leakage mode (with lower voltage supply) to save power. Evaluation results with eight MiBench programs show that the proposed design can reduce the leakage power of a data TLB by 37% on average, with performance degradation less than 0.01%.

  • Improving the Performance of Circuit-Switched Interconnection Network for a Multi-FPGA System

    Kohei ITO  Kensuke IIZUKA  Kazuei HIRONAKA  Yao HU  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2021/08/05
      Vol:
    E104-D No:12
      Page(s):
    2029-2039

    Multi-FPGA systems have gained attention because of their high performance and power efficiency. A multi-FPGA system called Flow-in-Cloud (FiC) is currently being developed as an accelerator of multi-access edge computing (MEC). FiC consists of multiple mid-range FPGAs tightly connected by high-speed serial links. Since time-critical jobs are assumed in MEC, a circuit-switched network with static time-division multiplexing (STDM) switches has been implemented on FiC. This paper investigates techniques of enhancing the interconnection performance of FiC. Unlike switching fabrics for Network on Chips or parallel machines, economical multi-FPGA systems, such as FiC, use Xilinx Aurora IP and FireFly cables with multiple lanes. We adopted the link aggregation and the slot distribution for using multiple lanes. To mitigate the bottleneck between an STDM switch and user logic, we also propose a multi-ejection STDM switch. We evaluated various combinations of our techniques by using three practical applications on an FiC prototype with 24 boards. When the number of slots is large and transferred data size is small, the slot distribution was sometimes more effective, while the link aggregation was superior for other most cases. Our multi-ejection STDM switch mitigated the bottleneck in ejection ports and successfully reduced the number of time slots. As a result, by combining the link aggregation and multi-ejection STDM switch, communication performance improved up to 7.50 times with few additional resources. Although the performance of the fast Fourier transform with the highest communication ratio could not be enhanced by using multiple boards when a lane was used, 1.99 times performance improvement was achieved by using 8 boards with four lanes and our multi-ejection switch compared with a board.

  • Applying Run-Length Compression to the Configuration Data of SLM Fine-Grained Reconfigurable Logic Open Access

    Souhei TAKAGI  Takuya KOJIMA  Hideharu AMANO  Morihiro KUGA  Masahiro IIDA  

     
    PAPER-Computer System

      Pubricized:
    2024/08/07
      Vol:
    E107-D No:12
      Page(s):
    1476-1483

    SLM (Scalable Logic Module) is a fine-grained reconfigurable logic developed by Kumamoto University. Its small configuration information size characterizes it, resulting in a smaller area for logic cells. We have been developing an SoC-type FPGA called SLMLET to take advantage of SLM. It keeps multiple sets of configuration data in the memory module inside the chip in a compressed form and exchanges them quickly. This paper proposes a simple run-length compression technique called TLC (Tag Less Compression). It achieved a 1.01-3.06 compression ratio, is embedded in the prototype of the SLMLET, and is available now. Then, we propose DMC (Duplication Module Compression), which uses repeatedly appearing patterns in the SLM configuration data. The DMC achieves a better compression ratio for complicated designs that are hard to compress with TLC.

  • FOREWORD Open Access

    Hideharu AMANO  

     
    FOREWORD

      Vol:
    E96-D No:8
      Page(s):
    1581-1581
  • FOREWORD Open Access

    Hideharu AMANO  

     
    FOREWORD

      Vol:
    E95-D No:2
      Page(s):
    293-293
  • Design and Implementation of RHiNET-2/NI0: A Reconfigurable Network Interface for Cluster Computing

    Tomonori YOKOYAMA  Naoyuki IZU  Jun-ichiro TSUCHIYA  Konosuke WATANABE  Hideharu AMANO  Tomohiro KUDOH  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    789-795

    A reconfigurable network interface called RHiNET-2/NI0 is developed for parallel processing of PCs distributed within one or more floors of a building. Two configurations: the HS (High Speed) configuration with only a high-speed primitive and the DSM (Distributed Shared Memory) configuration which supports sophisticated primitives can be selected by the network requirement. From the empirical evaluation, it appears that the HS configuration markedly improves the latency of data transfer compared with traditional network interfaces. On the other hand, the DSM configuration executes sophisticated primitives for distributed shared memory more than twice as fast as that of software implementation.

  • A Fine-Grained Multicasting of Configuration Data for Coarse-Grained Reconfigurable Architectures

    Takuya KOJIMA  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/04/05
      Vol:
    E102-D No:7
      Page(s):
    1247-1256

    A novel configuration data compression technique for coarse-grained reconfigurable architectures (CGRAs) is proposed. Reducing the size of configuration data of CGRAs shortens the reconfiguration time especially when the communication bandwidth between a CGRA and a host CPU is limited. In addition, it saves energy consumption of configuration cache and controller. The proposed technique is based on a multicast configuration technique called RoMultiC, which reduces the configuration time by multicasting the same data to multiple PEs (Processing Elements) with two bit-maps. Scheduling algorithms for an optimizing the order of multicasting have been proposed. However, the multicasting is possible only if each PE has completely the same configuration. In general, configuration data for CGRAs can be divided into some fields like machine code formats of general perpose CPUs. The proposed scheme confines a part of fields for multicasting so that the possibility of multicasting more PEs can be increased. This paper analyzes algorithms to find a configuration pattern which maximizes the number of multicasted PEs. We implemented the proposed scheme to CMA (Cool Mega Array), a straight forward CGRA as a case study. Experimental results show that the proposed method achieves 40.0% smaller configuration than a previous method for an image processing application at maximum. The exploration of the multicasted grain size reveals the effective grain size for each algorithm. Furthermore, since both a dynamic power consumption of the configuration controller and a configuration time are improved, it achieves 50.1% reduction of the energy consumption for the configuration with a negligible area overhead.

  • Pot: A General Purpose Monitor for Parallel Computers

    Yuso KANAMORI  Oki MINABE  Masaki WAKABAYASHI  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    2025-2033

    At the initial stage of developing parallel machines, a software monitor, which manages communication between host computers, program loading and debugging, is necessary. However, it is often a cumbersome job to develop such a monitoring system especially when the target takes a parallel architecture. To solve this problem, we developed an integrated monitor system called "Pot". "Pot" consists of a system runs on the host computer and simple code on a target machine. In order to reduce the development costs, the program on a target machine is as simple as possible while "Pot" on the host computer itself provides various functions for system development.

  • An FPGA-Based Acceleration Method for Metabolic Simulation

    Yasunori OSANA  Tomonori FUKUSHIMA  Masato YOSHIMI  Hideharu AMANO  

     
    PAPER-Recornfigurable Systems

      Vol:
    E87-D No:8
      Page(s):
    2029-2037

    Computer simulation of cellular process is one of the most important applications in bioinformatics. Since such simulators need huge computational resources, many biologists must use expensive PC/WS clusters. ReCSiP is an FPGA-based, reconfigurable accelerator which aims to realize economical high-performance simulation environment on desktop computers. It can exploit fine-grain parallelism in the target applications by small hardware modules in the FPGA which work in parallel manner. As the first step to implement a simulator of cellular process on ReCSiP, a solver to perform a basic simulation of metabolism was implemented. The throughput of the solver was about 29 times faster than the software on Intel's PentiumIII operating at 1.13 GHz.

  • A Batcher-Double-Omega Network with Combining

    Kalidou GAYE  Hideharu AMANO  

     
    PAPER-Computer Networks

      Vol:
    E75-D No:3
      Page(s):
    307-314

    The Batcher banyan network is well known as a non-blocking switching fabric. However, it is conflict free only when there is no packets for the same destination. To cope with the arbitrary combination of packets, an additional network or special control sequence which causes the increase of the hardware or performance degradation is required. A Batcher Double Omega network with Combining (BDOC) is an elegant solution of this problem. It consists of a Batcher sorter and two double sized Omega networks. Like in the Batcher banyan network, packets are sorted by the destination label in the Batcher sorter. In the first Omega network called the distributer, a packet is routed by a tag corresponding to the sum of the label at the output of the Batcher sorter and the destination label. In the second (Inverse) Omega network called the concentrator, the original destination label is used as the routing tag, and packets are routed without any conflict. The BDOC is useful for an interconnection network to connect processors and memory modules in multiprocessor. Unlike conventional multistage interconnection networks for multiprocessors, packets are transferred in a serial and synchronized manner. The simple structure of the switching element enables a high speed operation which reduces the latency caused by the serial communication. Using the pipelined circuit switching, the address and data packets share the same control signal, and the structure of the switching element is much simplified. Moreover, packets combining which avoids the hot spot contention is realized easily in the concentrator.

1-20hit(67hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.