1-6hit |
Hidetsugu IRIE Takefumi MIYOSHI Goki HONJO Kei HIRAKI Tsutomu YOSHINAGA
One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, but it has the drawback of cache pollution, unless its aggressiveness is properly set. Several techniques that have been proposed for prefetcher throttling use accuracy as a metric, but their robustness were not sufficient because of the variations in programs' working set sizes and cache capacities. In this study, we revisit prefetcher throttling from the viewpoint of data lifetime. Exploiting the characteristics of cache line reuse, we propose Cache-Convection-Control-based Prefetch Optimization Plus (CCCPO+), which enhances the feedback algorithm of our previous CCCPO. Evaluation results showed that this novel approach achieved a 30% improvement over no prefetching in the geometric mean of the SPEC CPU 2006 benchmark suite with 256 KB LLC, 1.8% over the latest prefetcher throttling, and 0.5% over our previous CCCPO. Moreover, it showed superior stability compared to related works, while lowering the hardware cost.
Junichi OHMURA Takefumi MIYOSHI Hidetsugu IRIE Tsutomu YOSHINAGA
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.
Yasin OGE Masato YOSHIMI Takefumi MIYOSHI Hideyuki KAWASHIMA Hidetsugu IRIE Tsutomu YOSHINAGA
In this paper, we propose Configurable Query Processing Hardware (CQPH), an FPGA-based accelerator for continuous query processing over data streams. CQPH is a highly optimized and minimal-overhead execution engine designed to deliver real-time response for high-volume data streams. Unlike most of the other FPGA-based approaches, CQPH provides on-the-fly configurability for multiple queries with its own dynamic configuration mechanism. With a dedicated query compiler, SQL-like queries can be easily configured into CQPH at run time. CQPH supports continuous queries including selection, group-by operation and sliding-window aggregation with a large number of overlapping sliding windows. As a proof of concept, a prototype of CQPH is implemented on an FPGA platform for a case study. Evaluation results indicate that a given query can be configured within just a few microseconds, and the prototype implementation of CQPH can process over 150 million tuples per second with a latency of less than a microsecond. Results also indicate that CQPH provides linear scalability to increase its flexibility (i.e., on-the-fly configurability) without sacrificing performance (i.e., maximum allowable clock speed).
Yasin OGE Takefumi MIYOSHI Hideyuki KAWASHIMA Tsutomu YOSHINAGA
A novel design is proposed to implement highly parallel stream join operators on a field-programmable gate array (FPGA), by examining handshake join algorithm for hardware implementation. The proposed design is evaluated in terms of the hardware resource usage, the maximum clock frequency, and the performance. Experimental results indicate that the proposed implementation can handle considerably high input rates, especially at low match rates. Results of simulation conducted to optimize size of buffers included in join and merge units give a new intuition regarding static and adaptive buffer tuning in handshake join.
Takefumi MIYOSHI Nobuhiko SUGINO
For a coarse grain dynamic reconfigurable processing unit cooperating with a general purpose processor, a context selection method, which can reduce total execution cycles of a given program, is proposed. The method evaluates context candidates from a given program, in terms of reduction in cycles by exploiting parallel and pipeline execution of the reconfigurable processor. According to this evaluation measure, the method selects appropriate contexts for the dynamic reconfigurable processing unit. The proposed method is implemented on the framework of COINS project. For several example programs, the generated codes are evaluated by a software simulator in terms of execution cycles, and these results prove the effectiveness of the proposed method.
Takefumi MIYOSHI Nobuhiko SUGINO
A novel unified phase compiler framework for embedded VLIWs and DSPs is shown. In this compiler, a given program is represented in 3-D representation space, which enables quantitatively estimating required resources and elapsed time. Transformation of a 3-D representation graph that corresponds to a code optimization method for a specific processor architecture is also proposed. The proposal compiler and the code optimization methods are compared with an ordinary compiler in terms of their generated codes. The results demonstrate their effectiveness.