Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA
An interconnection network is an inevitable component for constructing parallel computers. It connects computation nodes so that the nodes can communicate with each other. As a parallel computation essentially requires inter-node communication according to a parallel algorithm, the interconnection network plays an important role in terms of communication performance. This paper focuses on the collective communication that is frequently performed in parallel computation and this paper addresses the Cup-Stacking method that is proposed in our preceding work. The key issues of the method are splitting a large packet into slices, re-shaping the slice, and stacking the slices, in a genetic algorithm (GA) manner. This paper discusses extending the Cup-Stacking method by introducing additional items (genes) and proposes the extended Cup-Stacking method. Furthermore, this paper places comprehensive discussions on the drawbacks and further optimization of the method. Evaluation results reveal the effectiveness of the extended method, where the proposed method achieves at most seven percent improvement in duration time over the former Cup-Stacking method.
Naoya NIWA Hideharu AMANO Michihiro KOIBUCHI
This study presents a selective data-compression interconnection network to boost its performance. Data compression virtually increases the effective network bandwidth. One drawback of data compression is a long latency to perform (de-)compression operation at a compute node. In terms of the communication latency, we explore the trade-off between the compression latency overhead and the reduced injection latency by shortening the packet length by compression algorithms. As a result, we present to selectively apply a compression technique to a packet. We perform a compression operation to long packets and it is also taken when network congestion is detected at a source compute node. Through a cycle-accurate network simulation, the selective compression method using the above compression algorithms improves by up to 39% the network throughput with a moderate increase in the communication latency of short packets.
Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA
Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.
Kazuei HIRONAKA Kensuke IIZUKA Miho YAMAKURA Akram BEN AHMED Hideharu AMANO
Multi-FPGA systems have been receiving a lot of attention as a low cost and energy efficient system for Multi-access Edge Computing (MEC). For such purpose, a bare-metal multi-FPGA system called FiC (Flow-in-Cloud) is under development. In this paper, we introduce the FiC multi FPGA cluster which is applied partial reconfiguration (PR) FPGA design flow to support online user defined accelerator replacement while executing FPGA interconnection network and its low-level multiple FPGA management software called remote PR manager. With the remote PR manager, the user can define the FiC FPGA cluster setup by JSON and control the cluster from user application with the cooperation of simple cluster management tool / library called ficmgr on the client host and REST API service provider called ficwww on Raspberry Pi 3 (RPi3) on each node. According to the evaluation results with a prototype FiC FPGA cluster system with 12 nodes, using with online application replacement by PR and on-the-fly FPGA bitstream compression, the time for FPGA bitstream distribution was reduced to 1/17 and the total cluster setup time was reduced by 21∼57% than compared to cluster setup with full configuration FPGA bitstream.
Ryuta KAWANO Ryota YASUDO Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
Network throughput has become an important issue for big-data analysis on Warehouse-Scale Computing (WSC) systems. It has been reported that randomly-connected inter-switch networks can enlarge the network throughput. For irregular networks, a multi-path routing method called k-shortest path routing is conventionally utilized. However, it cannot efficiently exploit longer-than-shortest paths that would be detour paths to avoid bottlenecks. In this work, a novel routing method called k-optimized path routing to achieve high throughput is proposed for irregular networks. We introduce a heuristic to select detour paths that can avoid bottlenecks in the network to improve the average-case network throughput. Experimental results by network simulation show that the proposed k-optimized path routing can improve the saturation throughput by up to 18.2% compared to the conventional k-shortest path routing. Moreover, it can reduce the computation time required for optimization to 1/2760 at a minimum compared to our previously proposed method.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
Inter-node communication is essential in parallel computation. The performance of parallel processing depends on the efficiencies in both computation and communication, thus, the communication cost is not negligible. A parallel application program involves a logical communication structure that is determined by the interchange of data between computation nodes. Sometimes the logical communication structure mismatches to that in a real parallel machine. This mismatch results in large communication costs. This paper addresses the node-mapping problem that rearranges logical position of node so that the degree of mismatch is decreased. This paper assumes that parallel programs execute one or more collective communications that follow specific traffic patterns. An appropriate node-mapping achieves high communication performance. This paper proposes a strong heuristic method for solving the node-mapping problem and adapts the method to a genetic algorithm. Evaluation results reveal that the proposed method achieves considerably high performance; it achieves 8.9 (4.9) times speed-up on average in single-(two-)traffic-pattern cases in 32×32 torus networks. Specifically, for some traffic patterns in small-scale networks, the proposed method finds theoretically optimized solutions. Furthermore, this paper discusses in deep about various issues in the proposed method that employs genetic algorithm, such as population of genes, number of generations, and traffic patterns. This paper also discusses applicability to large-scale systems for future practical use.
Ryuta KAWANO Ryota YASUDO Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.
Yi-Xian YANG Kung-Jui PAI Ruay-Shiung CHANG Jou-Ming CHANG
A set of spanning trees of a graphs G are called completely independent spanning trees (CISTs for short) if for every pair of vertices x, y∈V(G), the paths joining x and y in any two trees have neither vertex nor edge in common, except x and y. Constructing CISTs has applications on interconnection networks such as fault-tolerant routing and secure message transmission. In this paper, we investigate the problem of constructing two CISTs in the balanced hypercube BHn, which is a hypercube-variant network and is superior to hypercube due to having a smaller diameter. As a result, the diameter of CISTs we constructed equals to 9 for BH2 and 6n-2 for BHn when n≥3.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
This paper intends to reduce duration times in typical collective communications. We introduce logical addressing system apart from the physical one and, by rearranging the logical node addresses properly, we intend to reduce communication overheads so that ideal communication is performed. One of the key issues is rearrangement of the logical addressing system. We introduce genetic algorithm (GA) as meta-heuristic solution as well as the random search strategy. Our GA-based method achieves at most 2.50 times speedup in three-traffic-pattern cases.
Shyue-Ming TANG Yue-Li WANG Chien-Yi LI Jou-Ming CHANG
Generalized recursive circulant graphs (GRCGs for short) are a generalization of recursive circulant graphs and provide a new type of topology for interconnection networks. A graph of n vertices is said to be s-pancyclic for some $3leqslant sleqslant n$ if it contains cycles of every length t for $sleqslant tleqslant n$. The pancyclicity of recursive circulant graphs was investigated by Araki and Shibata (Inf. Process. Lett. vol.81, no.4, pp.187-190, 2002). In this paper, we are concerned with the s-pancyclicity of GRCGs.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.
The exchanged hypercube, denoted by EH(s,t), is a graph obtained by systematically removing edges from the corresponding hypercube, while preserving many of the hypercube's attractive properties. Moreover, ring-connected topology is one of the most promising topologies in Wavelength Division Multiplexing (WDM) optical networks. Let Rn denote a ring-connected topology. In this paper, we address the routing and wavelength assignment problem for implementing the EH(s,t) communication pattern on Rn, where n=s+t+1. We design an embedding scheme. Based on the embedding scheme, a near-optimal wavelength assignment algorithm using 2s+t-2+⌊2t/3⌋ wavelengths is proposed. We also show that the wavelength assignment algorithm uses no more than an additional 25 percent of (or ⌊2t-1/3⌋) wavelengths, compared to the optimal wavelength assignment algorithm.
Ryuta KAWANO Hiroshi NAKAHARA Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.
Yao HU Ikki FUJIWARA Michihiro KOIBUCHI
A number of parallel applications run on a high-performance computing (HPC) system simultaneously. Job mapping and scheduling become crucial to improve system utilization, because fragmentation prevents an incoming job from being assigned even if there are enough compute nodes unused. Wireless supercomputers and datacenters with free-space optical (FSO) terminals have been proposed to replace the conventional wired interconnection so that a diverse application workload can be better supported by changing their network topologies. In this study we firstly present an efficient job mapping by swapping the endpoints of FSO links in a wireless HPC system. Our evaluation shows that an FSO-equipped wireless HPC system can achieve shorter average queuing length and queuing time for all the dispatched user jobs. Secondly, we consider the use of a more complicated and enhanced scheduling algorithm, which can further improve the system utilization over different host networks, as well as the average response time for all the dispatched user jobs. Finally, we present the performance advantages of the proposed wireless HPC system under more practical assumptions such as different cabinet capacities and diverse subtopology packings.
Thao-Nguyen TRUONG Khanh-Van NGUYEN Ikki FUJIWARA Michihiro KOIBUCHI
System expandability becomes a major concern for highly parallel computers and data centers, because their number of nodes gradually increases year by year. In this context we propose a low-degree topology and its floor layout in which a cabinet or node set can be newly inserted by connecting short cables to a single existing cabinet. Our graph analysis shows that the proposed topology has low diameter, low average shortest path length and short average cable length comparable to existing topologies with the same degree. When incrementally adding nodes and cabinets to the proposed topology, its diameter and average shortest path length increase modestly. Our discrete-event simulation results show that the proposed topology provides a comparable performance to 2-D Torus for some parallel applications. The network cost and power consumption of DSN-F modestly increase when compared to the counterpart non-random topologies.
Ahmed SHALABY Ikki FUJIWARA Michihiro KOIBUCHI
Recently network bandwidth becomes a performance concern particularly for collective communication since bisection bandwidths of supercomputers become far less than their full bisection bandwidths. In this context we propose the use of a network coding technique to reduce the number of unicasts and the size of data transferred in latency-sensitive collective communications in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communications as much as 70%. We also show that our hierarchical network coding is beneficial for any packet size.
Kung-Jui PAI Jinn-Shyong YANG Sing-Chen YAO Shyue-Ming TANG Jou-Ming CHANG
Let T1,T2,...,Tk be spanning trees in a graph G. If, for any two vertices u,v of G, the paths joining u and v on the k trees are mutually vertex-disjoint, then T1,T2,...,Tk are called completely independent spanning trees (CISTs for short) of G. The construction of CISTs can be applied in fault-tolerant broadcasting and secure message distribution on interconnection networks. Hasunuma (2001) first introduced the concept of CISTs and conjectured that there are k CISTs in any 2k-connected graph. Unfortunately, this conjecture was disproved by Péterfalvi recently. In this note, we give a necessary condition for k-connected k-regular graphs with ⌊k/2⌋ CISTs. Based on this condition, we provide more counterexamples for Hasunuma's conjecture. By contrast, we show that there are two CISTs in 4-regular chordal rings CR(N,d) with N=k(d-1)+j under the condition that k ≥ 4 is even and 0 ≤ j ≤ 4. In particular, the diameter of each constructed CIST is derived.
Wen-Yin HUANG Jia-Jie LIU Jou-Ming CHANG Ro-Yu WU
An n-dimensional folded hypercube, denoted by FQn, is an enhanced n-dimensional hypercube with one extra link between nodes that have the furthest Hamming distance. Let FFv (respectively, FFe) denote the set of faulty nodes (respectively, faulty links) in FQn. Under the assumption that every fault-free node in FQn is incident to at least two fault-free links, Hsieh et al. (Inform. Process. Lett. 110 (2009) pp.41-53) showed that if |FFv|+|FFe| ≤ 2n-4 for n ≥ 3, then FQn-FFv-FFe contains a fault-free cycle of length at least 2n-2|FFv|. In this paper, we show that, under the same conditional fault model, FQn with n ≥ 5 can tolerate more faulty elements and provides the same lower bound of the length of a longest fault-free cycle, i.e., FQn-FFv-FFe contains a fault-free cycle of length at least 2n-2|FFv| if |FFv|+|FFe| ≤ 2n-3 for n ≥ 5.