1-11hit |
Marcus WALLDEN Stefano MARKIDIS Masao OKITA Fumihiko INO
We propose a novel compositing pipeline and a dynamic load balancing technique for volume rendering which utilizes a two-layered group structure to achieve effective and scalable load balancing. The technique enables each process to render data from non-contiguous regions of the volume with minimal impact on the total render time. We demonstrate the effectiveness of the proposed technique by performing a set of experiments on a modern GPU cluster. The experiments show that using the technique results in up to a 35.7% lower worst-case memory usage as compared to a dynamic k-d tree load balancing technique, whilst simultaneously achieving similar or higher render performance. The proposed technique was also able to lower the amount of transferred data during the load balancing stage by up to 72.2%. The technique has the potential to be used in many scenarios where other dynamic load balancing techniques have proved to be inadequate, such as during large-scale visualization.
Yuechao LU Fumihiko INO Kenichi HAGIHARA
This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
Yasuharu MIZUTANI Fumihiko INO Kenichi HAGIHARA
This paper describes the design and implementation of a testbed for predicting master/slave (M/S) programs written using Message Passing Interface (MPI) programs. The testbed, named M/S Emulator (MSE), aims at assisting developers in evaluating the performance of M/S programs and dynamic load-balancing strategies on clusters of PCs. In order to realize this, MSE predicts the communication time by using a realistic parallel computational model, an extension of the LogGPS model. This extended model improves the prediction accuracy on a large number of processors, because it captures the master's bottleneck: the overhead required for retrieving arrival messages from the slaves. Current MSE also employs a best effort emulation method for predicting the calculation time. In our experiments, MSE demonstrated an accurate prediction on clusters, especially on a larger number of nodes. Therefore, we believe that our extended model enables us to analyze the scalability of the M/S program performance.
Yuji MISAKI Fumihiko INO Kenichi HAGIHARA
We propose a cache-aware method to accelerate texture-based volume rendering on a graphics processing unit (GPU) that is compatible with the compute unified device architecture. The proposed method extends a previous method such that it can maximize the average rendering performance while rotating the viewing direction around a volume. To realize this, the proposed method performs in-place rotation of volume data, which rearranges the order of voxels to allow consecutive threads (warps) to refer to voxels with the minimum access strides. Experiments indicate that the proposed method replaces the worst texture cache (TC) hit rate of 42% with the best TC hit rate of 93% for a 10243-voxel volume. Thus, the average frame rate increases by a factor of 1.6 in the proposed method compared with that in the previous method. Although the overhead of in-place rotation slightly decreases the frame rate from 2.0 frames per second (fps) to 1.9 fps, this slowdown occurs only with a few viewing directions.
Jingcheng SHEN Fumihiko INO Albert FARRÉS Mauricio HANZICH
Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
Fumihiko INO Shinta NAKAGAWA Kenichi HAGIHARA
This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.
Yasuhiro KAWASAKI Fumihiko INO Yoshinobu SATO Shinichi TAMURA Kenichi HAGIHARA
This paper presents the design and implementation of a hip range of motion (ROM) estimation method that is capable of fine-grained estimation during total hip replacement (THR) surgery. Our method is based on two acceleration strategies: (1) adaptive mesh refinement (AMR) for complexity reduction and (2) parallelization for further acceleration. On the assumption that the hip ROM is a single closed region, the AMR strategy reduces the complexity for N N N stance configurations from O(N3) to O(ND), where 2≤D≤3 and D is a data-dependent value that can be approximated by 2 in most cases. The parallelization strategy employs the master-worker paradigm with multiple task queues, reducing synchronization between processors with load balancing. The experimental results indicate that the implementation on a cluster of 64 PCs completes estimation of 360360180 stance configurations in 20 seconds, playing a key role in selecting and aligning the optimal combination of artificial joint components during THR surgery.
Yuechao LU Yasuyuki MATSUSHITA Fumihiko INO
Fast computation of singular value decomposition (SVD) is of great interest in various machine learning tasks. Recently, SVD methods based on randomized linear algebra have shown significant speedup in this regime. For processing large-scale data, computing systems with accelerators like GPUs have become the mainstream approach. In those systems, access to the input data dominates the overall process time; therefore, it is needed to design an out-of-core algorithm to dispatch the computation into accelerators. This paper proposes an accurate two-pass randomized SVD, named block randomized SVD (BRSVD), designed for matrices with a slow-decay singular spectrum that is often observed in image data. BRSVD fully utilizes the power of modern computing system architectures and efficiently processes large-scale data in a parallel and out-of-core fashion. Our experiments show that BRSVD effectively moves the performance bottleneck from data transfer to computation, so that outperforms existing randomized SVD methods in terms of speed with retaining similar accuracy.
Kazuro KIMURA Shinya HIGA Masao OKITA Fumihiko INO
In this paper, we propose an acceleration method for the Held-Karp algorithm that solves the symmetric traveling salesman problem by dynamic programming. The proposed method achieves acceleration with two techniques. First, we locate data-independent subproblems so that the subproblems can be solved in parallel. Second, we reduce the number of subproblems by a meet in the middle (MITM) technique, which computes the optimal path from both clockwise and counterclockwise directions. We show theoretical analysis on the impact of MITM in terms of the time and space complexities. In experiments, we compared the proposed method with a previous method running on a single-core CPU. Experimental results show that the proposed method on an 8-core CPU was 9.5-10.5 times faster than the previous method on a single-core CPU. Moreover, the proposed method on a graphics processing unit (GPU) was 30-40 times faster than that on an 8-core CPU. As a side effect, the proposed method reduced the memory usage by 48%.
Kensuke MURAKI Yasuhiro KAWASAKI Yasuharu MIZUTANI Fumihiko INO Kenichi HAGIHARA
In this paper, we present a resource monitoring and selection method for rapid turnaround grid applications (for example, within 10 seconds). The novelty of our method is the distributed evaluation of resources for rapidly selecting the appropriate idle resources. We integrate our method with a widely used resource management system, namely the Monitoring and Discovery System 2 (MDS2), and compare our method with the original MDS2 in terms of the performance and the scalability. The performance is measured using a 64-node cluster of PCs and the scalability is analyzed using a theoretical model and the measured performance. The experimental results show that our method reduces the resource selection time by 82%, as compared with the original MDS2. The scalability analysis also indicates that our method can keep the resource selection time within 1 second, up to 500 nodes in local-area-network (LAN) environments. In addition, some simulation results are presented to estimate the impact of our method for wide-area-network (WAN) environments.
Yuma MUNEKAWA Fumihiko INO Kenichi HAGIHARA
This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.