Xiaoman LIU Yujie GAO Yuan HE Xiaohan YUE Haiyan JIANG Xibo WANG
The complexity and scale of Networks-on-Chip (NoCs) are growing as more processing elements and memory devices are implemented on chips. However, under strict power budgets, it is also critical to lower the power consumption of NoCs for the sake of energy efficiency. In this paper, we therefore present three novel input unit designs for on-chip routers attempting to shrink their power consumption while still conserving the network performance. The key idea behind our designs is to organize buffers in the input units with characteristics of the network traffic in mind; as in our observations, only a small portion of the network traffic are long packets (composed of multiple flits), which means, it is fair to implement hybrid, asymmetric and reconfigurable buffers so that they are mainly targeting at short packets (only having a single flit), hence the smaller power consumption and area overhead. Evaluations show that our hybrid, asymmetric and reconfigurable input unit designs can achieve an average reduction of energy consumption per flit by 45%, 52.3% and 56.2% under 93.6% (for hybrid designs) and 66.3% (for asymmetric and reconfigurable designs) of the original router area, respectively. Meanwhile, we only observe minor degradation in network latency (ranging from 18.4% to 1.5%, on average) with our proposals.
Naoya NIWA Yoshiya SHIKAMA Hideharu AMANO Michihiro KOIBUCHI
Network-on-Chips (NoCs) are important components for scalable many-core processors. Because the performance of parallel applications is usually sensitive to the latency of NoCs, reducing it is a primary requirement. In this study, a compression router that hides the (de)compression-operation delay is proposed. The compression router (de)compresses the contents of the incoming packet before the switch arbitration is completed, thus shortening the packet length without latency penalty and reducing the network injection-and-ejection latency. Evaluation results show that the compression router improves up to 33% of the parallel application performance (conjugate gradients (CG), fast Fourier transform (FT), integer sort (IS), and traveling salesman problem (TSP)) and 63% of the effective network throughput by 1.8 compression ratio on NoC. The cost is an increase in router area and its energy consumption by 0.22mm2 and 1.6 times compared to the conventional virtual-channel router. Another finding is that off-loading the decompressor onto a network interface decreases the compression-router area by 57% at the expense of the moderate increase in communication latency.
Jiao GUAN Jueping CAI Ruilian XIE Yequn WANG Jinzhi LAI
This letter presents an oblivious and load-balanced routing (OLBR) method without virtual channels for 2D mesh Network-on-chip (NoC). To balance the traffic load of network and avoid deadlock, OLBR divides network nodes into two regions, one region contains the nodes of east and west sides of NoC, in which packets are routed by odd-even turn rule with Y direction preference (OE-YX), and the remaining nodes are divided to the other region, in which packets are routed by odd-even turn rule with alterable priority arbitration (OE-APA). Simulation results show that OLBR's saturation throughput can be improved than related works by 11.73% and OLBR balances the traffic load over entire network.
Meaad FADHEL Huaxi GU Wenting WEI
Recently, researchers paid more attention on designing optical routers, since they are essential building blocks of all photonic interconnection architectures. Thus, improving them could lead to a spontaneous improvement in the overall performance of the network. Optical routers suffer from the dilemma of increased insertion loss and crosstalk, which upraises the power consumed as the network scales. In this paper, we propose a new 7×7 non-blocking optical router based on the Dimension Order Routing (DOR) algorithm. Moreover, we develop a method that can ensure the least number of MicroRing Resonators (MRRs) in an optical router. Therefore, by reducing these optical devices, the optical router proposed can decrease the crosstalk and insertion loss of the network. This optical router is evaluated and compared to Ye's router and the optimized crossbar for 3D Mesh network that uses XYZ routing algorithm. Unlike many other proposed routers, this paper evaluates optical routers not only from router level prospective yet also consider the overall network level condition. The appraisals show that our optical router can reduce the worst-case network insertion loss by almost 8.7%, 46.39%, 39.3%, and 41.4% compared to Ye's router, optimized crossbar, optimized universal OR, and Optimized VOTEX, respectively. Moreover, it decreases the Optical Signal-to-Noise Ratio (OSNR) worst-case by almost 27.92%, 88%, 77%, and 69.6% compared to Ye's router, optimized crossbar, optimized universal OR, and Optimized VOTEX, respectively. It also reduces the power consumption by 3.22%, 23.99%, 19.12%, and 20.18% compared to Ye's router, optimized crossbar, optimized universal OR, and Optimized VOTEX, respectively.
We design a new oblivious routing algorithm for two-dimensional mesh-based Networks-on-Chip (NoCs) called LEF (Long Edge First) which offers high throughput with low design complexity. LEF's basic idea comes from conventional wisdom in choosing the appropriate dimension-order routing (DOR) algorithm for supercomputers with asymmetric mesh or torus interconnects: routing longest dimensions first provides better performance than other strategies. In LEF, we combine the XY DOR and the YX DOR. When routing a packet, which DOR algorithm is chosen depends on the relative position between the source node and the destination node. Decisions of selecting the appropriate DOR algorithm are not fixed to the network shape but instead made on a per-packet basis. We also propose an efficient deadlock avoidance method for LEF in which the use of virtual channels is more flexible than in the conventional method. We evaluate LEF against O1TURN, another effective oblivious routing algorithm, and a minimal adaptive routing algorithm based on the odd-even turn model. The evaluation results show that LEF is particularly effective when the communication is within an asymmetric mesh. In a 16×8 NoC, LEF even outperforms the adaptive routing algorithm in some cases and delivers from around 4% up to around 64.5% higher throughput than O1TURN. Our results also show that the proposed deadlock avoidance method helps to improve LEF's performance significantly and can be used to improve O1TURN's performance. We also examine LEF in large-scale NoCs with thousands of nodes. Our results show that, as the NoC size increases, the performance of the routing algorithms becomes more strongly influenced by the resource allocation policy in the network and the effect is different for each algorithm. This is evident in that results of middle-scale NoCs with around 100 nodes cannot be applied directly to large-scale NoCs.
Tao LIU Huaxi GU Yue WANG Wei ZOU
An optimized low-power optical memory access network is proposed to alleviate the cost of microring resonators (MRs) in kilocore systems, such as the pass-by loss and integration difficulty. Compared with traditional electronic bus interconnect, the proposed network reduces power consumption and latency by 80% to 89% and 21% to 24%. Moreover, the new network decreases the number of MRs by 90.6% without an increase in power consumption and latency when making a comparison with Optical Ring Network-on-Chip (ORNoC).
Xilu WANG Yongjun SUN Huaxi GU
The mapping optimization problem in Network-on-Chip (NoC) is constraint and NP-hard, and the deterministic algorithms require considerable computation time to find an exact optimal mapping solution. Therefore, the metaheuristic algorithms (MAs) have attracted great interests of researchers. However, most MAs are designed for continuous problems and suffer from premature convergence. In this letter, a binary metaheuristic mapping algorithm (BMM) with a better exploration-exploitation balance is proposed to solve the mapping problem. The binary encoding is used to extend the MAs to the constraint problem and an adaptive strategy is introduced to combine Sine Cosine Algorithm (SCA) and Particle Swarm Algorithm (PSO). SCA is modified to explore the search space effectively, while the powerful exploitation ability of PSO is employed for the global optimum. A set of well-known applications and large-scale synthetic cores-graphs are used to test the performance of BMM. The results demonstrate that the proposed algorithm can improve the energy consumption more significantly than some other heuristic algorithms.
Naohisa FUKASE Yasuyuki MIURA Shigeyoshi WATANABE M.M. HAFIZUR RAHMAN
The high performance network-on-chip (NoC) router using minimal hardware resources to minimize the layout area is very essential for NoC design. In this paper, we have proposed a memory sharing method of a wormhole routed NoC architecture to alleviate the area overhead of a NoC router. In the proposed method, a memory is shared by multiple physical links by using a multi-port memory. In this paper, we have proposed a partial link-sharing method and evaluated the communication performance using the proposed method. It is revealed that the resulted communication performance by the proposed methods is higher than that of the conventional method, and the progress ratio of the 3D-torus network is higher than that of 2D-torus network. It is shown that the improvement of communication performance using partial link sharing method is achieved with slightly increase of hardware cost.
Xuan-Tu TRAN Tung NGUYEN Hai-Phong PHAN Duy-Hieu BUI
The increasing demand on scalability and reusability of system-on-chip design as well as the decoupling between computation and communication has motivated the growth of the Network-on-Chip (NoC) paradigm in the last decade. In NoC-based systems, the computational resources (i.e. IPs) communicate with each other using a network infrastructure. Many works have focused on the development of NoC architectures and routing mechanisms, while the interfacing between network and associated IPs also needs to be considered. In this paper, we present a novel efficient AXI (AMBA eXtensible Interface) compliant network adapter for NoC architectures, which is named an AXI-NoC adapter. The proposed network adapter achieves high communication throughput of 20.8Gbits/s and consumes 4.14mW at the operating frequency of 650MHz. It has a low area footprint (952 gates, approximate to 2,793µm2 with CMOS 45nm technology) thanks to its effective hybrid micro-architectures and with zero latency thanks to the proposed mux-selection method.
Ruilian XIE Jueping CAI Xin XIN Bo YANG
This letter presents a Preferable Mad-y (PMad-y) turn model and Low-cost Adaptive and Fault-tolerant Routing (LAFR) method that use one and two virtual channels along the X and Y dimensions for 2D mesh Network-on-Chip (NoC). Applying PMad-y rules and using the link status of neighbor routers within 2-hops, LAFR can tolerate multiple faulty links and routers in more complicated faulty situations and impose the reliability of network without losing the performance of network. Simulation results show that LAFR achieves better saturation throughput (0.98% on average) than those of other fault-tolerant routing methods and maintains high reliability of more than 99.56% on average. For achieving 100% reliability of network, a Preferable LAFR (PLAFR) is proposed.
Akram BEN AHMED Hiroki MATSUTANI Michihiro KOIBUCHI Kimiyoshi USAMI Hideharu AMANO
In this paper, the Multi-voltage (multi-Vdd) variable pipeline router is proposed to reduce the power consumption of Network-on-Chips (NoCs) designed for Chip Multi-processors (CMPs). The multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike Dynamic Voltage and Frequency Scaling (DVFS) routers, the operating frequency remains the same for all routers throughout the CMP; thus, omitting the need to synchronize neighboring routers working at different frequencies. Two types of router architectures are presented: a Coarse-Grained Variable Pipeline (CG-VP) router that changes the voltage supplied to the entire router, and a Fine-Grained Variable Pipeline (FG-VP) router that uses a finer power partition. The evaluation results showed that the CG-VP and FG-VP routers achieve a 22.9% and 35.3% power reduction on average with 14% and 23% area overhead in comparison with a baseline router without variable pipelines, respectively. Thanks to the adopted look-ahead mechanism to switch the supply voltage, the performance overhead is only 4.4%.
Lian ZENG Tieyuan PAN Xin JIANG Takahiro WATANABE
As the semiconductor technology continues to develop, hundreds of cores will be deployed on a single die in the future Chip-Multiprocessors (CMPs) design. Three-Dimensional Network-on-Chips (3D NoCs) has become an attractive solution which can provide impressive high performance. An efficient and deadlock-free routing algorithm is a critical to achieve the high performance of network-on-chip. Traditional methods based on deterministic and turn model are deadlock-free, but they are unable to distribute the traffic loads over the network. In this paper, we propose an efficient, adaptive and deadlock-free algorithm (EAR) based on a novel routing selection strategy in 3D NoC, which can distribute the traffic loads not only in intra-layers but also in inter-layers according to congestion information and path diversity. Simulation results show that the proposed method achieves the significant performance improvement compared with others.
Jie JIAN Mingche LAI Liquan XIAO
With the development of silicon-based Nano-photonics, Optical Network on Chip (ONoC) is, due to its high bandwidth and low latency, becoming an important choice for future multi-core networks. As a key ONoC technology, the arbitration scheme should provide differential arbitration service with high throughput and low latency for various types and priorities of traffic in CMPs. In this work, we propose a fast hierarchical arbitration scheme based on multi-level priority QoS. First, given multi-priority data buffer queue, arbiters provide differential transmissions with fair service for all nodes and guarantee the max-transmit-delay and min-communication-bandwidth for all queues. Second, arbiter adopts the transmit bound resource reservation scheme to reserve time slots for all nodes fairly, thereby achieving a throughput of 100%. Third, we propose fast arbitration with a layout of fast optical arbitration channels (FOACs) to reduce the arbitration period, thereby reducing packet transmitting delay. Simulation results show that with our hierarchical arbitration scheme, all nodes are allocated almost equal service access probability under various traffic patterns; thus, the min-communication-bandwidth and max-transmit-delay is guaranteed to be 5% and 80 cycles, respectively, under the overload demands. This scheme improves throughput by 17% compared to FeatherWeight under a self-similar traffic pattern and decreases arbitration delay by 15% compare to 2-pass arbitration, incurring a total power overhead of 5%.
Akira MOCHIZUKI Hirokatsu SHIRAHAMA Yuma WATANABE Takahiro HANYU
An energy-efficient intra-chip communication link circuit with ternary current signaling is proposed for an asynchronous Network-on-Chip. The data signal encoded by an asynchronous three-state protocol is represented by a small-voltage-swing three-level intermediate signal, which results in the reduction of transition delay and achieving energy-efficient data transfer. The three-level voltage is generated by using a combination of dynamically controlled current sources with feedback loop mechanism. Moreover, the proposed circuit contains a power-saving scheme where the dynamically controlled transistors also are utilized. By cutting off the current paths when the data transfer on the communication link is inactive, the power dissipation can be greatly reduced. It is demonstrated that the average data-transfer speed is about 1.5 times faster than that of a binary CMOS implementation using a 130nm CMOS technology at the supply voltage of 1.2V.
Shijun LIN Zhaoshan LIU Jianghong SHI Xiaofang WU
In this paper, we propose a scalable connection-based time division multiple access architecture for wireless NoC. In this architecture, only one-hop transmission is needed when a packet is transmitted from one wired subnet to another wired subnet, which improves the communication performance and cuts down the energy consumption. Furthermore, by carefully designing the central arbiter, the bandwidth of the wireless channel can be fully used. Simulation results show that compared with the traditional WCube wireless NoC architecture, the proposed architecture can greatly improve the network throughput, and cut down the transmission latency and energy consumption with a reasonable area overhead.
Naoya ONIZAWA Akira MOCHIZUKI Hirokatsu SHIRAHAMA Masashi IMAI Tomohiro YONEDA Takahiro HANYU
This paper introduces a partially parallel inter-chip link architecture for asynchronous multi-chip Network-on-Chips (NoCs). The multi-chip NoCs that operate as a large NoC have been recently proposed for very large systems, such as automotive applications. Inter-chip links are key elements to realize high-performance multi-chip NoCs using a limited number of I/Os. The proposed asynchronous link based on level-encoded dual-rail (LEDR) encoding transmits several bits in parallel that are received by detecting the phase information of the LEDR signals at each serial link. It employs a burst-mode data transmission that eliminates a per-bit handshake for a high-speed operation, but the elimination may cause data-transmission errors due to cross-talk and power-supply noises. For triggering data retransmission, errors are detected from the embedded phase information; error-detection codes are not used. The throughput is theoretically modelled and is optimized by considering the bit-error rate (BER) of the link. Using delay parameters estimated for a 0.13 µm CMOS technology, the throughput of 8.82 Gbps is achieved by using 10 I/Os, which is 90.5% higher than that of a link using 9 I/Os without an error-detection method operating under negligible low BER (<10-20).
Takashi MIYAMORI Hui XU Hiroyuki USUI Soichiro HOSODA Toru SANO Kazumasa YAMAMOTO Takeshi KODAKA Nobuhiro NONOGAKI Nau OZAKI Jun TANABE
New media processing applications such as image recognition and AR (Augment Reality) have become into practical on embedded systems for automotive, digital-consumer and mobile products. Many-core processors have been proposed to realize much higher performance than multi-core processors. We have developed a low-power many-core SoC for multimedia applications in 40nm CMOS technology. Within a 210mm2 die, two 32-core clusters are integrated with dynamically reconfigurable processors, hardware accelerators, 2-channel DDR3 I/Fs, and other peripherals. Processor cores in the cluster share a 2MB L2 cache connected through a tree-based Network-on-Chip (NoC). Its total peak performance exceeds 1.5TOPS (Tera Operations Per Second). The high scalability and low power consumption are accomplished by parallelized software for multimedia applications. In case of face detection, the performance scales up to 64 cores and the SoC consumes only 2.21W. Moreover, it can execute the 1080p 48fps H.264 decoding about 520mW by 28 cores and the 4K2K 15fps super resolution about 770mW by 32 cores in one cluster. Exploiting parallelism by low power processor cores, the many-core SoC provides several tens of times better energy efficiency than that of a high performance desk-top quad-core processor.
Huaxi GU Zheng CHEN Yintang YANG Hui DING
Optical Network-on-Chip (ONoC) is a promising emerging technology, which can solve the bottlenecks faced by electrical on-chip interconnection. However, the existing proposals of ONoC are mostly built on fixed topologies, which are not flexible enough to support various applications. To make full use of the limited resource and provide a more efficient approach for resource allocation, RONoC (Reconfigurable Optical Network-on-Chip) is proposed in this letter. The topology can be reconfigured to meet the requirement of different applications. An 8×8 nonblocking router is also designed, together with the communication mechanism. The simulation results show that the saturation load of RONoC is 2 times better than mesh, and the energy consumption is 25% lower than mesh.
Ahmadou Dit Adi CISSE Michihiro KOIBUCHI Masato YOSHIMI Hidetsugu IRIE Tsutomu YOSHINAGA
Silicon photonics Network-on-Chips (NoCs) have emerged as an attractive solution to alleviate the high power consumption of traditional electronic interconnects. In this paper, we propose a fully optical ring NoC that combines static and dynamic wavelength allocation communication mechanisms. A different wavelength-channel is statically allocated to each destination node for light weight communication. Contention of simultaneous communication requests from multiple source nodes to the destination is solved by a token based arbitration for the particular wavelength-channel. For heavy load communication, a multiwavelength-channel is available by requesting it in execution time from source node to a special node that manages dynamic allocation of the shared multiwavelength-channel among all nodes. We combine these static and dynamic communication mechanisms in a same network that introduces selection techniques based on message size and congestion information. Using a photonic NoC simulator based on Phoenixsim, we evaluate our architecture under uniform random, neighbor, and hotspot traffic patterns. Simulation results show that our proposed fully optical ring NoC presents a good performance by utilizing adequate static and dynamic channels based on the selection techniques. We also show that our architecture can reduce by more than half, the energy consumption necessary for arbitration compared to hybrid photonic ring and mesh NoCs. A comparison with several previous works in term of architecture hardware cost shows that our architecture can be an attractive cost-performance efficient interconnection infrastructure for future SoCs and CMPs.
We propose a fault diagnosis and reconfiguration method based on the Pair and Swap scheme to improve the reliability and the MTTF (Mean Time To Failure) of network-on-chip based multiple processor systems where each processor core has its private memory. In the proposed scheme, two identical copies of a given task are executed on a pair of processor cores and the results are compared repeatedly in order to detect processor faults. If a fault is detected by mismatches, the fault is identified and isolated using a TMR (Triple Module Redundancy) and the system is reconfigured by the redundant processor cores. We propose that each task is quadruplicated and statically assigned to private memories so that each memory has only two different tasks. We evaluate the reliability of the proposed quadruplicated task allocation scheme in the viewpoint of MTTF. As a result, the MTTF of the proposed scheme is over 4.3 times longer than that of the duplicated task allocation scheme.