Naoya NIWA Hideharu AMANO Michihiro KOIBUCHI
This study presents a selective data-compression interconnection network to boost its performance. Data compression virtually increases the effective network bandwidth. One drawback of data compression is a long latency to perform (de-)compression operation at a compute node. In terms of the communication latency, we explore the trade-off between the compression latency overhead and the reduced injection latency by shortening the packet length by compression algorithms. As a result, we present to selectively apply a compression technique to a packet. We perform a compression operation to long packets and it is also taken when network congestion is detected at a source compute node. Through a cycle-accurate network simulation, the selective compression method using the above compression algorithms improves by up to 39% the network throughput with a moderate increase in the communication latency of short packets.
Takahiro KAWAGUCHI Naofumi TAKAGI
A 32-bit arithmetic logic unit (ALU) is designed for a rapid single flux quantum (RSFQ) bit-parallel processor. In the ALU, clocked gates are partially replaced by clockless gates. This reduces the number of D flip flops (DFFs) required for path balancing. The number of clocked gates, including DFFs, is reduced by approximately 40 %, and size of the clock distribution network is reduced. The number of pipeline stages becomes modest. The layout design of the ALU and simulation results show the effectiveness of using clockless gates in wide datapath circuits.
Hongjie XU Jun SHIOMI Hidetoshi ONODERA
Hardware accelerators are designed to support a specialized processing dataflow for everchanging deep neural networks (DNNs) under various processing environments. This paper introduces two hardware properties to describe the cost of data movement in each memory hierarchy. Based on the hardware properties, this paper proposes a set of evaluation metrics that are able to evaluate the number of memory accesses and the required memory capacity according to the specialized processing dataflow. Proposed metrics are able to analytically predict energy, throughput, and area of a hardware design without detailed implementation. Once a processing dataflow and constraints of hardware resources are determined, the proposed evaluation metrics quickly quantify the expected hardware benefits, thereby reducing design time.
Taku SUZUKI Mikihito SUZUKI Kenichi HIGUCHI
This paper proposes a parallel peak cancellation (PC) process for the computational complexity-efficient algorithm called PC with a channel-null constraint (PCCNC) in the adaptive peak-to-average power ratio (PAPR) reduction method using the null space in a multiple-input multiple-output (MIMO) channel for MIMO-orthogonal frequency division multiplexing (OFDM) signals. By simultaneously adding multiple PC signals to the time-domain transmission signal vector, the required number of iterations of the iterative algorithm is effectively reduced along with the PAPR. We implement a constraint in which the PC signal is transmitted only to the null space in the MIMO channel by beamforming (BF). By doing so the data streams do not experience interference from the PC signal on the receiver side. Since the fast Fourier transform (FFT) and inverse FFT (IFFT) operations at each iteration are not required unlike the previous algorithm and thanks to the newly introduced parallel processing approach, the enhanced PCCNC algorithm reduces the required total computational complexity and number of iterations compared to the previous algorithms while achieving the same throughput-vs.-PAPR performance.
In this paper, we extend the notion of bijective connection graphs to introduce directed bijective connection graphs. We propose algorithms that solve the node-to-set node-disjoint paths problem and the node-to-node node-disjoint paths problem in a directed bijective connection graph. The time complexities of the algorithms are both O(n4), and the maximum path lengths are both 2n-1.
Junsuk PARK Nobuhiro SEKI Keiichi KANEKO
In the topologies for interconnected nodes, it is desirable to have a low degree and a small diameter. For the same number of nodes, a dual-cube topology has almost half the degree compared to a hypercube while increasing the diameter by just one. Hence, it is a promising topology for interconnection networks of massively parallel systems. We propose here a stochastic fault-tolerant routing algorithm to find a non-faulty path from a source node to a destination node in a dual-cube.
The Möbius cube is a variant of the hypercube. Its advantage is that it can connect the same number of nodes as a hypercube but with almost half the diameter of the hypercube. We propose an algorithm to solve the node-to-node disjoint paths problem in n-Möbius cubes in polynomial-order time of n. We provide a proof of correctness of the algorithm and estimate that the time complexity is O(n2) and the maximum path length is 3n-5.
Takumi HONDA Yasuaki ITO Koji NAKANO
In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
David KOCIK Yuki HIRAI Keiichi KANEKO
This paper proposes an algorithm that solves the node-to-set disjoint paths problem in an n-Möbius cube in polynomial-order time of n. It also gives a proof of correctness of the algorithm as well as estimating the time complexity, O(n4), and the maximum path length, 2n-1. A computer experiment is conducted for n=1,2,...,31 to measure the average performance of the algorithm. The results show that the average time complexity is gradually approaching to O(n3) and that the maximum path lengths cannot be attained easily over the range of n in the experiment.
Yuta MATSUI Shinji FUKUMA Shin-ichiro MORI
In this paper, the repeatable hybrid parallel implementation of inverse matrix computation using SMW formula is proposed. The authors' had previously proposed a hybrid parallel algorithm for inverse matrix computation. It is reasonably fast for a one time computation of an inverse matrix, but it is hard to apply this algorithm repeatedly for consecutive computations since the relocation of the large matrix is required at the beginning of each iterations. In order to eliminate the relocation of the large input matrix which is the output of the inverse matrix computation from the previous time step, the computation algorithm has been redesigned so that the required portion of the input matrix becomes the same as the output portion of the previously computed matrix in each node. This makes it possible to repeatedly and efficiently apply the SMW formula to compute inverse matrix in a time-series simulation.
Xiantao JIANG Tian SONG Takashi SHIMAMOTO Wen SHI Lisheng WANG
The next generation high efficiency video coding (HEVC) standard achieves high performance by extending the encoding block to 64×64. There are some parallel tools to improve the efficiency for encoder and decoder. However, owing to the dependence of the current prediction block and surrounding block, parallel processing at CU level and Sub-CU level are hard to achieve. In this paper, focusing on the spatial motion vector prediction (SMVP) and temporal motion vector prediction (TMVP), parallel improvement for spatio-temporal prediction algorithms are presented, which can remove the dependency between prediction coding units and neighboring coding units. Using this proposal, it is convenient to process motion estimation in parallel, which is suitable for different parallel platforms such as multi-core platform, compute unified device architecture (CUDA) and so on. The simulation experiment results demonstrate that based on HM12.0 test model for different test sequences, the proposed algorithm can improve the advanced motion vector prediction with only 0.01% BD-rate increase that result is better than previous work, and the BDPSNR is almost the same as the HEVC reference software.
In recent years, there has been a significant growth in the importance of data mining of graph-structured data due to this technology's rapid increase in both scale and application areas. Many previous studies have investigated decision tree learning on Semantic Web-based linked data to uncover implicit knowledge. In the present paper, we suggest a new random forest algorithm for linked data to overcome the underlying limitations of the decision tree algorithm, such as local optimal decisions and generalization error. Moreover, we designed a parallel processing environment for random forest learning to manage large-size linked data and increase the efficiency of multiple tree generation. For this purpose, we modified the previous candidate feature searching method of the decision tree algorithm for linked data to reduce the feature searching space of random forest learning and developed feature selection methods that are adjusted to linked data. Using a distributed index-based search engine, we designed a parallel random forest learning system for linked data to generate random forests in parallel. Our proposed system enables users to simultaneously generate multiple decision trees from distributed stored linked data. To evaluate the performance of the proposed algorithm, we performed experiments to compare the classification accuracy when using the single decision tree algorithm. The experimental results revealed that our random forest algorithm is more accurate than the single decision tree algorithm.
SangHyuck BAE DoYoung JUNG CheolSe KIM KyoungMoon LIM Yong-Surk LEE
For a large-sized touch screen, we designed and evaluated a real-time touch microarchitecture using a field-programmable gate array (FPGA). A high-speed hardware accelerator based on a parallel touch algorithm is suggested and implemented in this letter. The touch controller also has a timing control unit and an analog digital convert (ADC) control unit for analog touch sensing circuits. Measurement results of processing time showed that the touch controller with its proposed microarchitecture is five times faster than the 32-bit reduced instruction set computer (RISC) processor without the touch accelerator.
Ryota TAKASU Yoichi TOMIOKA Yutaro ISHIGAKI Ning LI Tsugimichi SHIBATA Mamoru NAKANISHI Hitoshi KITAZAWA
Electromagnetic field analysis is a time-consuming process, and a method involving the use of an FPGA accelerator is one of the attractive ways to accelerate the analysis; the other method involve the use of CPU and GPU. In this paper, we propose an FPGA accelerator dedicated for a two-dimensional finite-difference time-domain (FDTD) method. This accelerator is based on a two-dimensional single instruction multiple data (SIMD) array architecture. Each processing element (PE) is composed of a six-stage pipeline that is optimized for the FDTD method. Moreover, driving signal generation and impedance termination are also implemented in the hardware. We demonstrate that our accelerator is 11 times faster than existing FPGA accelerators and 9 times faster than parallel computing on the NVIDIA Tesla C2075. As an application of the high-speed FDTD accelerator, the design optimization of a waveguide is shown.
Hidenori KUWAKADO Shoichi HIROSE
A hash function is an important primitive for cryptographic protocols. Since algorithms of well-known hash functions are almost serial, it seems difficult to take full advantage of recent multi-core processors. This paper proposes a multilane hashing (MLH) mode that achieves both of high parallelism and high security. The MLH mode is designed in such a way that the processing speed is almost linear in the number of processors. Since the MLH mode exploits an existing hash function as a black box, it is applicable to any hash function. The bound on the indifferentiability of the MLH mode from a random oracle is beyond the birthday bound on the output length of an underlying primitive.
Song-Hyon KIM Kyong-Ha LEE Inchul SONG Hyebong CHOI Yoon-Joon LEE
We address the problem of processing graph pattern matching queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process such queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filter-and-verify scheme working on the MapReduce framework. Moreover, we devise an efficient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of scalability and efficiency.
Shotaro IWANAGA Shinji FUKUMA Shin-ichiro MORI
In this paper, a hybrid parallel implementation of inverse matrix computation using SMW formula is proposed. By aggregating the memory bandwidth in the hybrid parallel implementation, the bottleneck due to the memory bandwidth limitation in the authors previous multicore implementation has been dissolved. More than 8 times of speed up is also achieved with dual-core 8-nodes implementation which leads more than 20 simulation steps per second, or near real-time performance.
Turbo codes suffer from high decoding latency which hinders their utilization in many communication systems. Parallel decodable turbo codes (PDTCs) are suitable for parallel decoding and hence have low latency. In this article, we analyze the worst case minimum distance of parallel decodable turbo codes with both S-random interleaver and memory collision free Row-Column S-random interleaver. The effect of minimum distance on code performance is determined through computer simulations.
Junichi OHMURA Takefumi MIYOSHI Hidetsugu IRIE Tsutomu YOSHINAGA
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.
Hideki MIWA Ryutaro SUSUKITA Hidetomo SHIBAMURA Tomoya HIRAO Jun MAKI Makoto YOSHIDA Takayuki KANDO Yuichiro AJIMA Ikuo MIYOSHI Toshiyuki SHIMIZU Yuji OINAGA Hisashige ANDO Yuichi INADOMI Koji INOUE Mutsumi AOYAGI Kazuaki MURAKAMI
In the near future, interconnection networks of massively parallel computer systems will connect more than a hundred thousands of computing nodes. The performance evaluation of the interconnection networks can provide real insights to help the development of efficient communication library. Hence, to evaluate the performance of such interconnection networks, simulation tools capable of modeling the networks with sufficient details, supporting a user-friendly interface to describe communication patterns, providing the users with enough performance information, completing simulations within a reasonable time, are a real necessity. This paper introduces a novel interconnection network simulator NSIM, for the evaluation of the performance of extreme-scale interconnection networks. The simulator implements a simplified simulation model so as to run faster without any loss of accuracy. Unlike the existing simulators, NSIM is built on the execution-driven simulation approach. The simulator also provides a MPI-compatible programming interface. Thus, the simulator can emulate parallel program execution and correctly simulate point-to-point and collective communications that are dynamically changed by network congestion. The experimental results in this paper showed sufficient accuracy of this simulator by comparing the simulator and the real machine. We also confirmed that the simulator is capable of evaluating ultra large-scale interconnection networks, consumes smaller memory area, and runs faster than the existing simulator. This paper also introduces a simulation service built on a cloud environment. Without installing NSIM, users can simulate interconnection networks with various configurations by using a web browser.