1-19hit |
KuanChao CHU Satoshi YAMAZAKI Hideki NAKAYAMA
This work focuses on training dataset enhancement of informative relational triplets for Scene Graph Generation (SGG). Due to the lack of effective supervision, the current SGG model predictions perform poorly for informative relational triplets with inadequate training samples. Therefore, we propose two novel training dataset enhancement modules: Feature Space Triplet Augmentation (FSTA) and Soft Transfer. FSTA leverages a feature generator trained to generate representations of an object in relational triplets. The biased prediction based sampling in FSTA efficiently augments artificial triplets focusing on the challenging ones. In addition, we introduce Soft Transfer, which assigns soft predicate labels to general relational triplets to make more supervisions for informative predicate classes effectively. Experimental results show that integrating FSTA and Soft Transfer achieve high levels of both Recall and mean Recall in Visual Genome dataset. The mean of Recall and mean Recall is the highest among all the existing model-agnostic methods.
Scalable networking for scientific research data transfer is a vital factor in the progress of data-intensive research, such as collaborative research on observation of black hole. In this paper, investigations of the nature of practical research traffic allow us to introduce optical flow switching (OFS) and contents delivery network (CDN) technologies into a wide area network (WAN) to realize highly scalable networking. To measure the scalability of networks, energy consumption in the WAN is evaluated by considering the practical networking equipment as well as reasonable assumptions on scientific research data transfer networks. In this study, we explore the energy consumption performance of diverse Japan and US topologies and reveal that the energy consumption of a routing and wavelength assignment algorithm in an OFS scheduler becomes the major hurdle when the number of nodes is high, for example, as high as that of the United States of America layer 1 topology. To provide computational scalability of a network dimensioning algorithm for the CDN based WAN, a simple heuristic algorithm for a surrogate location problem is proposed and compared with an optimal algorithm. This paper provides intuitions and design rules for highly scalable research data transfer networks, and thus, it can accelerate technology advancements against the encountering big-science problems.
Eun-Sung JUNG Si LIU Rajkumar KETTIMUTHU Sungwook CHUNG
The scale of scientific data generated by experimental facilities and simulations in high-performance computing facilities has been proliferating with the emergence of IoT-based big data. In many cases, this data must be transmitted rapidly and reliably to remote facilities for storage, analysis, or sharing, for the Internet of Things (IoT) applications. Simultaneously, IoT data can be verified using a checksum after the data has been written to the disk at the destination to ensure its integrity. However, this end-to-end integrity verification inevitably creates overheads (extra disk I/O and more computation). Thus, the overall data transfer time increases. In this article, we evaluate strategies to maximize the overlap between data transfer and checksum computation for astronomical observation data. Specifically, we examine file-level and block-level (with various block sizes) pipelining to overlap data transfer and checksum computation. We analyze these pipelining approaches in the context of GridFTP, a widely used protocol for scientific data transfers. Theoretical analysis and experiments are conducted to evaluate our methods. The results show that block-level pipelining is effective in maximizing the overlap mentioned above, and can improve the overall data transfer time with end-to-end integrity verification by up to 70% compared to the sequential execution of transfer and checksum, and by up to 60% compared to file-level pipelining.
Takashi G. SATO Yoshifumi SHIRAKI Takehiro MORIYA
The purpose of this study was to examine an efficient interval encoding method with a slow-frame-rate image sensor, and show that the encoding can work to capture heart rates from multiple persons. Visible light communication (VLC) with an image sensor is a powerful method for obtaining data from sensors distributed in the field with their positional information. However, the capturing speed of the camera is usually not fast enough to transfer interval information like the heart rate. To overcome this problem, we have developed an event timing (ET) encoding method. In ET encoding, sensor units detect the occurrence of heart beat event and send their timing through a sequence of flashing lights. The first flash signal provides the rough timing and subsequent signals give the precise timing. Our theoretical analysis shows that in most cases the ET encoding method performs better than simple encoding methods. Heart rate transfer from multiple persons was examined as an example of the method's capabilities. In the experimental setup, the developed system successfully monitored heart rates from several participants.
Dafei HUANG Changqing XUN Nan WU Mei WEN Chunyuan ZHANG Xing CAI Qianming YANG
Aiming to ease the parallel programming for heterogeneous architectures, we propose and implement a high-level OpenCL runtime that conceptually merges multiple heterogeneous hardware devices into one virtual heterogeneous compute device (VHCD). Moreover, automated workload distribution among the devices is based on offline profiling, together with new programming directives that define the device-independent data access range per work-group. Therefore, an OpenCL program originally written for a single compute device can, after inserting a small number of programming directives, run efficiently on a platform consisting of heterogeneous compute devices. Performance is ensured by introducing the technique of virtual cache management, which minimizes the amount of host-device data transfer. Our new OpenCL runtime is evaluated by a diverse set of OpenCL benchmarks, demonstrating good performance on various configurations of a heterogeneous system.
Yoichi TOMIOKA Ryota TAKASU Takashi AOKI Eiichi HOSOYA Hitoshi KITAZAWA
Hardware acceleration is an essential technique for extracting and tracking moving objects in real time. It is desirable to design tracking algorithms such that they are applicable for parallel computations on hardware. Exclusive block matching methods are designed for hardware implementation, and they can realize detailed motion extraction as well as robust moving object tracking. In this study, we develop tracking hardware based on an exclusive block matching method on FPGA. This tracking hardware is based on a two-dimensional systolic array architecture, and can realize robust moving object extraction and tracking at more than 100 fps for QVGA images using the high parallelism of an exclusive block matching method, synchronous shift data transfer, and special circuits to accelerate searching the exclusive correspondence of blocks.
A multiple-valued data transfer scheme using X-net is proposed to realize a compact bit-serial reconfigurable VLSI (BS-RVLSI). In the multiple-valued data transfer scheme using X-net, two binary data can be transferred from two adjacent cells to one common adjacent cell simultaneously at each “X” intersection. One cell composed of a logic block and a switch block is connected to four adjacent cross points by four one-bit switches so that the complexity of the switch block is reduced to 50% in comparison with the cell of a BS-RVLSI using an eight nearest-neighbor mesh network (8-NNM). In the logic block, threshold logic circuits are used to perform threshold operations, and then their binary dual-rail voltage outputs enter a binary logic module which can be programmed to realize an arbitrary two-variable binary function or a bit-serial adder. As a result, the configuration memory count and transistor count of the proposed multiple-valued cell are reduced to 34% and 58%, respectively, in comparison with those of an equivalent CMOS cell. Moreover, its power consumption for an arbitrary 2-variable binary function becomes 67% at 800 MHz under the condition of the same delay time.
Yoshitaka HIRAMATSU Hasitha Muthumala WAIDYASOORIYA Masanori HARIYAMA Toru NOJIRI Kunio UCHIYAMA Michitaka KAMEYAMA
The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.
Naoya ONIZAWA Takahiro HANYU Vincent C. GAUDET
This paper presents a high-throughput bit-serial low-density parity-check (LDPC) decoder that uses an asynchronous interleaver. Since consecutive log-likelihood message values on the interleaver are similar, node computations are continuously performed by using the most recently arrived messages without significantly affecting bit-error rate (BER) performance. In the asynchronous interleaver, each message's arrival rate is based on the delay due to the wire length, so that the decoding throughput is not restricted by the worst-case latency, which results in a higher average rate of computation. Moreover, the use of a multiple-valued data representation makes it possible to multiplex control signals and data from mutual nodes, thus minimizing the number of handshaking steps in the asynchronous interleaver and eliminating the clock signal entirely. As a result, the decoding throughput becomes 1.3 times faster than that of a bit-serial synchronous decoder under a 90 nm CMOS technology, at a comparable BER.
Raghuvel S. BHUVANESWARAN Yoshiaki KATAYAMA Naohisa TAKAHASHI
Data grid consists of scattered computing and storage resources located dispersedly in the grid network. These large sized data sets are replicated in more than one site for the better availability to the other nodes in a grid. Downloading the dataset from these replicated locations have practical difficulties and we find interest in a co-allocated download framework, which enables parallel download of replicated data from multiple servers. In this paper, we proposed a dynamic co-allocation scheme for parallel data transfer in grid environment, which copes up with highly inconsistent network and server performance. The model comprises of co-allocator, monitor and control mechanisms. The scheme initially obtains the bandwidth parameter from the monitor module to fix the partition size and the data transfer tasks are allocated onto the servers in duplication. In this way, the process of data transfer can neither be interrupted nor paralyzed, even when the network link is broken or server crash. We used Globus toolkit for our framework by making use of grid information and GridFTP services. We compared our scheme with the existing schemes and the results show notable improvement in overall completion time of data transfer.
Tomoaki TSUGAWA Go HASEGAWA Masayuki MURATA
In the present paper, ImTCP-bg, a new background TCP data transfer mechanism that uses an inline network measurement technique, is proposed. ImTCP-bg sets the upper limit of the congestion window size of the sender TCP based on the results of the inline network measurement, which measures the available bandwidth of the network path between the sender and receiver hosts. ImTCP-bg can provide background data transfer without affecting the foreground traffic, whereas previous methods cannot avoid network congestion. ImTCP-bg also employs an enhanced RTT-based mechanism so that ImTCP-bg can detect and resolve network congestion, even when reliable measurement results cannot be obtained. The performance of ImTCP-bg is investigated through simulations, and the effectiveness of ImTCP-bg in terms of the degree of interference with foreground traffic and the link bandwidth utilization is also investigated.
Kultida ROJVIBOONCHAI Toru OSUGA Hitoshi AIDA
We have proposed Rate-based Multi-path Transmission Control Protocol (R-M/TCP) for improving reliability and performance of data transfer over the Internet by using multiple paths. Congestion control in R-M/TCP is performed in a rate-based and loss-avoidance manner. It attempts to estimate the available bandwidth and the queue length of the used routes in order to fully utilize the bandwidth resources. However, it has been reported that when the used routes' characteristics, i.e. available bandwidth and delay, are much different, R-M/TCP cannot achieve the desired throughput from the routes. This is because R-M/TCP originally transmits data packets in a round-robin manner through the routes. In this paper, therefore, we propose R-M/TCP using Packet Scheduling Algorithm (PSA). Instead of using the round-robin manner, R-M/TCP utilizes PSA that accounts for time-varying bandwidth and delay of each path so that number of data packets arriving in out-of-order at the receiver can be minimized and the desired throughput can be achieved. Quantitative simulations are conducted to show effectiveness of R-M/TCP using PSA.
Conventional delay-insensitive (DI) data encodings require 2N+1 wires for transferring N-bit. To reduce complexity and power dissipation of wires in designing a large scaled chip, a DI data transfer mechanism based on current-mode multiple valued logic (CMMVL), where N-bit data transfer can be performed with only N+1 wires, is proposed. The effectiveness of the proposed data transfer mechanism is validated by comparisons with conventional data transfer mechanisms using dual-rail and 1-of-4 encodings through simulation at the 0.25-µm CMOS technology. Simulation results with wire lengths of 4 mm or larger demonstrate that the CMMVL scheme significantly reduces delay-power product values of the dual-rail encoding with data rate of 5 MHz or more and the 1-of-4 encoding with data rate of 18 MHz or more.
Akira MOCHIZUKI Takashi TAKEUCHI Takahiro HANYU
A new common-bus architecture with temporal and spatial parallel access capabilities under wire-resource constraint is proposed to transfer vast quantities of data between modules inside a VLSI chip. Since bus controllers are distributed into modules, the proposed bus architecture can directly transfer data from one module to another without any central bus control unit like a Direct Memory Access (DMA) controller, which enables to reduce communication steps for data transfer between modules. Moreover, when a start address and the number of block data in both source/destination modules are determined at the first step of a data-transfer scheme, no additional address setting for the data transfer is required in the rest of the scheme, which allows us to use all the wire resources as only the "data bus." Therefore, the bus function is dynamically programmed, which results in achieving high throughput of bus communication. For example, in case of a 64-line common bus, it is evaluated that the maximum data throughput in the proposed architecture with dynamic bus-function programming is four times higher than that in the conventional DMA bus architecture with fixed 32-bit-address/32-bit-data buses.
Hiroyasu OBATA Kenji ISHIDA Junichi FUNASAKA Kitsutaro AMANO
Asymmetric networks, which provide asymmetric bandwidth or delay for upstream and downstream transfer, have recently gained much attention since they support popular applications such as the World Wide Web (WWW). HTTP (Hypertext Transfer Protocol) is the basis of most WWW services so, evaluating the performance of HTTP on asymmetric networks is increasingly important, particularly real-world networks. However, the performance of HTTP on the asymmetric networks composed of satellite and terrestrial links has not sufficiently evaluated. This paper proposes new formulas to evaluate the performance of both HTTP1.0 and HTTP1.1 on asymmetric networks. Using these formulas, we calculate the time taken to transfer web data by HTTP1.0/1.1. The calculation results are compared to the results of an existing theoretical formula and experimental results gained from a system that combines a VSAT (Very Small Aperture Terminal) satellite communication system for satellite links (downstream) and the Internet for terrestrial links (upstream). The comparison shows that the proposed formulas yield more accurate results (compared to the measured values) than the existing formula. Furthermore, this paper proposes an evaluation formula for pipelined HTTP1.1, and shows that the values output by the proposed formula agree with those obtained by experiments (on the VSAT system) and simulations.
Teruji SHIROSHITA Tetsuo SANO Osamu TAKAHASHI Nagatsugu YAMANOUCHI
This paper evaluates the performance of a reliable multicast protocol for bulk-data transfer over unreliable networks via IP-multicast. Bulk-data type reliable multicast appears promising for commercial publishing and large-scale data replication. The proposed reliable multicast transport protocol (RMTP) provides high-performance due to the use of IP multicast while also providing confirmed and error free transfer by end-to-end controls. The protocol includes a multi-round selective repeat scheme dedicated for bulk-data multicast applications. This paper examines the multicast retransmission procedures in RMTP through analysis and tests on an implemented system and clarifies the basic performance behavior of the protocol. Evaluations are conducted with regard to retransmission redundancy, transfer time, and packet processing load with various error conditions and number of receivers. Against the response concentration problem seen in end-to-end communication, the backoff time algorithm is applied to the protocol; the limits it places on system scalability are clarified.
Teruyuki HASEGAWA Toru HASEGAWA Toshihiko KATO Kenji SUZUKI
Most of current real time video retrieval systems use video transfer protocols such that servers simply transmit video packets in the same rate as clients play them. If any packets are corrupted during transmission, they will be lost and cannot be recovered by retransmission. In video retrieval systems, however, teh video data are stored in servers and clients can prefetch them prior to playing. So, it might be possible for the video retrieval systems to make corrupted video packets retransmitted before the play-out dead line. But the application of existing reliable protocols causes problems such that, if a packet does not arrive before the dead line due to retransmission, the packets following it will not be delivered to the upper layer even if they have already arrived. In this paper, we discuss how to apply reliable protocols to real time video retrieval systems and propose an new real time video transfer protocol over ATM network, which provides the video data prefetch, the flow control for video buffer, the selective retransmission with skipping function for video packets late for the play-out dead line, and the resynchronization function for video buffer. We have implemented an experimental system using our protocol and evaluated the performance. The results of performance evaluation shows that the proposed protocol decreases the number of unplayed video data largely when transmission errors are inserted in an ATM network.
Tetsuhiko FUJII Akira YAMAMOTO Naoya TAKAHASHI Minoru YOSHIDA
This paper proposes a masked data transferring method for the write-back controlled disk cache system employing a fixed-length recording disk drive, enabling data transfer of discontinuous sectors on the same track between the cache and the disk. This paper also evaluates the method. In write-back controlled disk cache sytems, random write requests cause dirty data (write-pending data on a cache) on discontinuous areas on the cache. It is likely that several sectors on the same track become dirty. These dirty sectors must be written onto the disk according to the cache management scheme. In conventional data transferring methods between a disk cache and a disk drive, plural sectors can be transferred in one single operation when the sectors are adjacent, but discrete sectors must be transferred by individual operations. In the methods, an address of the head sector and number of sectors to be transferred are given to the transfer unit. For example, when two sectors on the same track are located closely but not adjacently, and data transfer is requested for those two sectors, the transfer operation for the second sector must be prepared after the first transfer had completed and before the second sector arrives under the disk head. Although the time for the head to pass by the uninterested sector is often too short for the software overhead for the first transfer to be completed and the second transfer to be prepared, which leads to an unwanted extra rotation of the disk. With the masked transferring method proposed in this paper, the micro program creates a bit-map specifying the target sectors to be transferred and passes it to the data transfer unit, enabling to transfer the discontinuous sectors without latency. The method was evaluated using OLTP warkloads. Results show an improvement in random I/O throughput of between 8% and 27%. The masked transferring method is adopted in Hitachi's A-6521 disk subsytems, shipped since December 1993.
Hiroshi OHTA Kousuke SAKODA Koichiro ISHIHARA
In a distributed-memory parallel-processing system, the overhead of data transfer among the processors is so large that it is important to reduce the data transfer. We consider the data transfer in evaluating an expression consisting of data distributed among the processors. We propose some algorithms which assign the operators in the expression to the processors so as to minimize the number or the cost of data transfers, on the condition that the data allocation to the processors is given. The basic algorithm is given at first, followed by some variations.