Daihan WANG Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
The regular 2-D mesh topology has been utilized for most of Network-on-Chips (NoCs) on FPGAs. Spatially biased traffic generated in some applications makes a customization method for removing links more efficient, since some links become low utilization. In this paper, a link removal strategy that customizes the router in NoC is proposed for reconfigurable systems in order to minimize the required hardware amount. Based on the pre-analyzed traffic information, links on which the communication amount is small are removed to reduce the hardware cost while maintaining adequate performance. Two policies are proposed to avoid deadlocks and they outperform up*/down* routing, which is a representative deadlock-free routing on irregular topology. In the case of the image recognition application susan, the proposed method can save 30% of the hardware amount without performance degradation.
Vasutan TUNBUNHENG Hideharu AMANO
For developing design environment of various Dynamically Reconfigurable Processor Arrays (DRPAs), the Graph with Configuration Information (GCI) is proposed to represent configurable resource in the target dynamically reconfigurable architecture. The functional unit, constant unit, register, and routing resource can be represented in the graph as well as the configuration information. The restriction in the hardware is also added in the graph by limiting the possible configuration at a node controlled by the other node. A prototype compiler called Black-Diamond with GCI is now available for three different DRPAs. It translates data-flow graph from C-like front-end description, applies placement and routing by using the GCI, and generates configuration data for each element of the DRPA. Evaluation results of simple applications show that Black-Diamond can generate reasonable designs for all three different architectures. Other target architectures can be easily treated by representing many aspects of architectural property into a GCI.
Dynamically reconfigurable processors are consisting of an array of processing elements whose functions and interconnections can be dynamically changed. 9 commercial systems are picked up, and their array structures, processing elements and interconnection architectures are classified.
Ryuta KAWANO Ryota YASUDO Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.
Akram BEN AHMED Hiroki MATSUTANI Michihiro KOIBUCHI Kimiyoshi USAMI Hideharu AMANO
In this paper, the Multi-voltage (multi-Vdd) variable pipeline router is proposed to reduce the power consumption of Network-on-Chips (NoCs) designed for Chip Multi-processors (CMPs). The multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike Dynamic Voltage and Frequency Scaling (DVFS) routers, the operating frequency remains the same for all routers throughout the CMP; thus, omitting the need to synchronize neighboring routers working at different frequencies. Two types of router architectures are presented: a Coarse-Grained Variable Pipeline (CG-VP) router that changes the voltage supplied to the entire router, and a Fine-Grained Variable Pipeline (FG-VP) router that uses a finer power partition. The evaluation results showed that the CG-VP and FG-VP routers achieve a 22.9% and 35.3% power reduction on average with 14% and 23% area overhead in comparison with a baseline router without variable pipelines, respectively. Thanks to the adopted look-ahead mechanism to switch the supply voltage, the performance overhead is only 4.4%.
Atsushi KOSHIBA Mikiko SATO Kimiyoshi USAMI Hideharu AMANO Ryuichi SAKAMOTO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI
Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.
Michihiro KOIBUCHI Akiya JOURAKU Hideharu AMANO
Adaptive routing algorithms, which dynamically select the route of a packet, have been widely studied for interconnection networks in massively parallel computers. An output selection function (OSF), which decides the output channel when some legal channels are free, is essential for an adaptive routing. In this paper, we propose a simple and efficient OSF called minimal multiplexed and least-recently-used (MMLRU). The MMLRU selection function has the following simple strategies for distributing the traffic: 1) each router locally grasps the congestion information by the utilization ratio of its own physical channels; 2) it is divided into the two selection steps, the choice from available physical channels and the choice from available virtual channels. The MMLRU selection function can be used on any type of network topology and adaptive routing algorithm. Simulation results show that the MMLRU selection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.
The instruction set architecture of MBP-light, a dedicated processor for the DSM (Distributed Shared Memory) management of JUMP-1 is analyzed with a real prototype. The Buffer-Register Architecture proposed for MBP-core improves performance with 5.64% in the home cluster and 6.27% in a remote cluster. Only a special instruction for hashing cluster address is efficient and improves the performance with 2.80%, but other special instructions are almost useless. It appears that the dominant operations in the DSM management program were handling packet queues assigned into the local cluster. Thus, common RISC instructions, especially load/store instructions, are frequently used. Separating instruction and data memory improves performance with 33%. The results suggest that another alternative which provides separate on-chip cache and instructions dedicated for packet queue management is advantageous.
Atsushi MURATA Taisuke BOKU Hideharu AMANO
The recent advance of semiconductor technologies enable to produce a medium size of crossbar with reasonable cost. By making the best use of the high bandwidth of such crossbars, indirect networks including the base-m n-cube and HyperCross have been proposed and researched. In these networks, a node is connected other nodes through crossbars in multiple dimensions. Although these networks are practically used in commercial machines, almost no discussion on a class of networks including them has been done. In this paper, a network class called Multi-Dimensional X'bar (MDX) which includes the above two networks is defined. Several new networks in this class are proposed, and relationship between these networks and direct networks/multistage interconnection networks is discussed. Finally, routing methods for these new networks are proposed and the average distance is evaluated. Through the discussion and evaluation, the MDX supports higher bandwidth than the corresponding multistage interconnection network with smaller hardware than the corresponding direct network.
Zhao LEI Hui XU Daisuke IKEBUCHI Tetsuya SUNATA Mitaro NAMIKI Hideharu AMANO
This paper presents a leakage efficient data TLB (Translation Look-aside Buffer) design for embedded processors. Due to the data locality in programs, data TLB references tend to hit only a small number of pages during short execution intervals. After dividing the overall execution time into smaller time slices, a leakage reduction mechanism is proposed to detect TLB entries which actually serve for virtual-to-physical address translations within each time slice. Thus, with the integration of the dual voltage supply technique, those TLB entries which are not used for address translations can be put into low leakage mode (with lower voltage supply) to save power. Evaluation results with eight MiBench programs show that the proposed design can reduce the leakage power of a data TLB by 37% on average, with performance degradation less than 0.01%.
Kohei ITO Kensuke IIZUKA Kazuei HIRONAKA Yao HU Michihiro KOIBUCHI Hideharu AMANO
Multi-FPGA systems have gained attention because of their high performance and power efficiency. A multi-FPGA system called Flow-in-Cloud (FiC) is currently being developed as an accelerator of multi-access edge computing (MEC). FiC consists of multiple mid-range FPGAs tightly connected by high-speed serial links. Since time-critical jobs are assumed in MEC, a circuit-switched network with static time-division multiplexing (STDM) switches has been implemented on FiC. This paper investigates techniques of enhancing the interconnection performance of FiC. Unlike switching fabrics for Network on Chips or parallel machines, economical multi-FPGA systems, such as FiC, use Xilinx Aurora IP and FireFly cables with multiple lanes. We adopted the link aggregation and the slot distribution for using multiple lanes. To mitigate the bottleneck between an STDM switch and user logic, we also propose a multi-ejection STDM switch. We evaluated various combinations of our techniques by using three practical applications on an FiC prototype with 24 boards. When the number of slots is large and transferred data size is small, the slot distribution was sometimes more effective, while the link aggregation was superior for other most cases. Our multi-ejection STDM switch mitigated the bottleneck in ejection ports and successfully reduced the number of time slots. As a result, by combining the link aggregation and multi-ejection STDM switch, communication performance improved up to 7.50 times with few additional resources. Although the performance of the fast Fourier transform with the highest communication ratio could not be enhanced by using multiple boards when a lane was used, 1.99 times performance improvement was achieved by using 8 boards with four lanes and our multi-ejection switch compared with a board.
Souhei TAKAGI Takuya KOJIMA Hideharu AMANO Morihiro KUGA Masahiro IIDA
SLM (Scalable Logic Module) is a fine-grained reconfigurable logic developed by Kumamoto University. Its small configuration information size characterizes it, resulting in a smaller area for logic cells. We have been developing an SoC-type FPGA called SLMLET to take advantage of SLM. It keeps multiple sets of configuration data in the memory module inside the chip in a compressed form and exchanges them quickly. This paper proposes a simple run-length compression technique called TLC (Tag Less Compression). It achieved a 1.01-3.06 compression ratio, is embedded in the prototype of the SLMLET, and is available now. Then, we propose DMC (Duplication Module Compression), which uses repeatedly appearing patterns in the SLM configuration data. The DMC achieves a better compression ratio for complicated designs that are hard to compress with TLC.
Tomonori YOKOYAMA Naoyuki IZU Jun-ichiro TSUCHIYA Konosuke WATANABE Hideharu AMANO Tomohiro KUDOH
A reconfigurable network interface called RHiNET-2/NI0 is developed for parallel processing of PCs distributed within one or more floors of a building. Two configurations: the HS (High Speed) configuration with only a high-speed primitive and the DSM (Distributed Shared Memory) configuration which supports sophisticated primitives can be selected by the network requirement. From the empirical evaluation, it appears that the HS configuration markedly improves the latency of data transfer compared with traditional network interfaces. On the other hand, the DSM configuration executes sophisticated primitives for distributed shared memory more than twice as fast as that of software implementation.
A novel configuration data compression technique for coarse-grained reconfigurable architectures (CGRAs) is proposed. Reducing the size of configuration data of CGRAs shortens the reconfiguration time especially when the communication bandwidth between a CGRA and a host CPU is limited. In addition, it saves energy consumption of configuration cache and controller. The proposed technique is based on a multicast configuration technique called RoMultiC, which reduces the configuration time by multicasting the same data to multiple PEs (Processing Elements) with two bit-maps. Scheduling algorithms for an optimizing the order of multicasting have been proposed. However, the multicasting is possible only if each PE has completely the same configuration. In general, configuration data for CGRAs can be divided into some fields like machine code formats of general perpose CPUs. The proposed scheme confines a part of fields for multicasting so that the possibility of multicasting more PEs can be increased. This paper analyzes algorithms to find a configuration pattern which maximizes the number of multicasted PEs. We implemented the proposed scheme to CMA (Cool Mega Array), a straight forward CGRA as a case study. Experimental results show that the proposed method achieves 40.0% smaller configuration than a previous method for an image processing application at maximum. The exploration of the multicasted grain size reveals the effective grain size for each algorithm. Furthermore, since both a dynamic power consumption of the configuration controller and a configuration time are improved, it achieves 50.1% reduction of the energy consumption for the configuration with a negligible area overhead.
Yuso KANAMORI Oki MINABE Masaki WAKABAYASHI Hideharu AMANO
At the initial stage of developing parallel machines, a software monitor, which manages communication between host computers, program loading and debugging, is necessary. However, it is often a cumbersome job to develop such a monitoring system especially when the target takes a parallel architecture. To solve this problem, we developed an integrated monitor system called "Pot". "Pot" consists of a system runs on the host computer and simple code on a target machine. In order to reduce the development costs, the program on a target machine is as simple as possible while "Pot" on the host computer itself provides various functions for system development.
Yasunori OSANA Tomonori FUKUSHIMA Masato YOSHIMI Hideharu AMANO
Computer simulation of cellular process is one of the most important applications in bioinformatics. Since such simulators need huge computational resources, many biologists must use expensive PC/WS clusters. ReCSiP is an FPGA-based, reconfigurable accelerator which aims to realize economical high-performance simulation environment on desktop computers. It can exploit fine-grain parallelism in the target applications by small hardware modules in the FPGA which work in parallel manner. As the first step to implement a simulator of cellular process on ReCSiP, a solver to perform a basic simulation of metabolism was implemented. The throughput of the solver was about 29 times faster than the software on Intel's PentiumIII operating at 1.13 GHz.
The Batcher banyan network is well known as a non-blocking switching fabric. However, it is conflict free only when there is no packets for the same destination. To cope with the arbitrary combination of packets, an additional network or special control sequence which causes the increase of the hardware or performance degradation is required. A Batcher Double Omega network with Combining (BDOC) is an elegant solution of this problem. It consists of a Batcher sorter and two double sized Omega networks. Like in the Batcher banyan network, packets are sorted by the destination label in the Batcher sorter. In the first Omega network called the distributer, a packet is routed by a tag corresponding to the sum of the label at the output of the Batcher sorter and the destination label. In the second (Inverse) Omega network called the concentrator, the original destination label is used as the routing tag, and packets are routed without any conflict. The BDOC is useful for an interconnection network to connect processors and memory modules in multiprocessor. Unlike conventional multistage interconnection networks for multiprocessors, packets are transferred in a serial and synchronized manner. The simple structure of the switching element enables a high speed operation which reduces the latency caused by the serial communication. Using the pipelined circuit switching, the address and data packets share the same control signal, and the structure of the switching element is much simplified. Moreover, packets combining which avoids the hot spot contention is realized easily in the concentrator.