Keyword Search Result

[Keyword] data stream(20hit)

1-20hit
  • Continuous Similarity Search for Dynamic Text Streams

    Yuma TSUCHIDA  Kohei KUBO  Hisashi KOGA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2023/09/21
      Vol:
    E106-D No:12
      Page(s):
    2026-2035

    Similarity search for data streams has attracted much attention for information recommendation. In this context, recent leading works regard the latest W items in a data stream as an evolving set and reduce similarity search for data streams to set similarity search. Whereas they consider standard sets composed of items, this paper uniquely studies similarity search for text streams and treats evolving sets whose elements are texts. Specifically, we formulate a new continuous range search problem named the CTS problem (Continuous similarity search for Text Sets). The task of the CTS problem is to find all the text streams from the database whose similarity to the query becomes larger than a threshold ε. It abstracts a scenario in which a user-based recommendation system searches similar users from social networking services. The CTS is important because it allows both the query and the database to change dynamically. We develop a fast pruning-based algorithm for the CTS. Moreover, we discuss how to speed up it with the inverted index.

  • Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries and Its Variant

    Tomohiro YAMAZAKI  Hisashi KOGA  

     
    PAPER

      Pubricized:
    2022/02/07
      Vol:
    E105-D No:5
      Page(s):
    898-908

    We study the continuous similarity search problem for evolving queries which has recently been formulated. Given a data stream and a database composed of n sets of items, the purpose of this problem is to maintain the top-k most similar sets to the query which evolves over time and consists of the latest W items in the data stream. For this problem, the previous exact algorithm adopts a pruning strategy which, at the present time T, decides the candidates of the top-k most similar sets from past similarity values and computes the similarity values only for them. This paper proposes a new exact algorithm which shortens the execution time by computing the similarity values only for sets whose similarity values at T can change from time T-1. We identify such sets very fast with frequency-based inverted lists (FIL). Moreover, we derive the similarity values at T in O(1) time by updating the previous values computed at time T-1. Experimentally, our exact algorithm runs faster than the previous exact algorithm by one order of magnitude and as fast as the previous approximation algorithm.

  • A P2P Sensor Data Stream Delivery System That Guarantees the Specified Reachability under Churn Situations

    Tomoya KAWAKAMI  Tomoki YOSHIHISA  Yuuichi TERANISHI  

     
    PAPER

      Pubricized:
    2019/02/06
      Vol:
    E102-D No:5
      Page(s):
    932-941

    In this paper, we propose a method to construct a scalable sensor data stream delivery system that guarantees the specified delivery quality of service (i.e., total reachability to destinations), even when delivery server resources (nodes) are in a heterogeneous churn situation. A number of P2P-based methods have been proposed for constructing a scalable and efficient sensor data stream system that accommodates different delivery cycles by distributing communication loads of the nodes. However, no existing method can guarantee delivery quality of service when the nodes on the system have a heterogeneous churn rate. As an extension of existing methods, which assign relay nodes based on the distributed hashing of the time-to-deliver, our method specifies the number of replication nodes, based on the churn rate of each node and on the relevant delivery paths. Through simulations, we confirmed that our proposed method can guarantee the required reachability, while avoiding any increase in unnecessary resource assignment costs.

  • Grouping Methods for Pattern Matching over Probabilistic Data Streams

    Kento SUGIURA  Yoshiharu ISHIKAWA  Yuya SASAKI  

     
    PAPER

      Pubricized:
    2017/01/17
      Vol:
    E100-D No:4
      Page(s):
    718-729

    As the development of sensor and machine learning technologies has progressed, it has become increasingly important to detect patterns from probabilistic data streams. In this paper, we focus on complex event processing based on pattern matching. When we apply pattern matching to probabilistic data streams, numerous matches may be detected at the same time interval because of the uncertainty of data. Although existing methods distinguish between such matches, they may derive inappropriate results when some of the matches correspond to the real-world event that has occurred during the time interval. Thus, we propose two grouping methods for matches. Our methods output groups that indicate the occurrence of complex events during the given time intervals. In this paper, first we describe the definition of groups based on temporal overlap, and propose two grouping algorithms, introducing the notions of complete overlap and single overlap. Then, we propose an efficient approach for calculating the occurrence probabilities of groups by using deterministic finite automata that are generated from the query patterns. Finally, we empirically evaluate the effectiveness of our methods by applying them to real and synthetic datasets.

  • A Sensor Data Stream Delivery Method to Accommodate Heterogeneous Cycles on Cloud

    Tomoya KAWAKAMI  Yoshimasa ISHI  Tomoki YOSHIHISA  Yuuichi TERANISHI  

     
    PAPER-Network

      Vol:
    E99-B No:6
      Page(s):
    1331-1340

    In the future Internet of Things/M2M network, enormous amounts of data generated from sensors must be processed and utilized by cloud applications. In recent years, sensor data stream delivery, which collects and sends sensor data periodically, has been attracting great attention. As for sensor data stream delivery, the receivers have different delivery cycle requirements depending on the applications or situations. In this paper, we propose a sensor data stream delivery method to accommodate heterogeneous cycles on the cloud. The proposed method uses distributed hashing to determine relay nodes on the cloud and construct delivery paths autonomously. We evaluate the effectiveness of the proposed method in simulations. The simulation results show that the proposed method halves the maximum load of nodes compared to the baseline methods and achieves high load balancing.

  • Design and Evaluation of a Configurable Query Processing Hardware for Data Streams

    Yasin OGE  Masato YOSHIMI  Takefumi MIYOSHI  Hideyuki KAWASHIMA  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER-Computer System

      Pubricized:
    2015/09/14
      Vol:
    E98-D No:12
      Page(s):
    2207-2217

    In this paper, we propose Configurable Query Processing Hardware (CQPH), an FPGA-based accelerator for continuous query processing over data streams. CQPH is a highly optimized and minimal-overhead execution engine designed to deliver real-time response for high-volume data streams. Unlike most of the other FPGA-based approaches, CQPH provides on-the-fly configurability for multiple queries with its own dynamic configuration mechanism. With a dedicated query compiler, SQL-like queries can be easily configured into CQPH at run time. CQPH supports continuous queries including selection, group-by operation and sliding-window aggregation with a large number of overlapping sliding windows. As a proof of concept, a prototype of CQPH is implemented on an FPGA platform for a case study. Evaluation results indicate that a given query can be configured within just a few microseconds, and the prototype implementation of CQPH can process over 150 million tuples per second with a latency of less than a microsecond. Results also indicate that CQPH provides linear scalability to increase its flexibility (i.e., on-the-fly configurability) without sacrificing performance (i.e., maximum allowable clock speed).

  • An Online Framework for Flow Round Trip Time Measurement

    Xinjie GUAN  Xili WAN  Ryoichi KAWAHARA  Hiroshi SAITO  

     
    PAPER-Network

      Vol:
    E97-B No:10
      Page(s):
    2145-2156

    With the advent of high speed links, online flow measurement for, e.g., flow round trip time (RTT), has become difficult due to the enormous demands placed on computational resources. Most existing measurement methods are designed to count the numbers of flows or sizes of flows, but we address the flow RTT measurement, which is an important QoS metric for network management and cannot be measured with existing measurement methods. We first adapt a standard Bloom Filter (BF) for the flow RTT distribution estimation. However, due to the existence of multipath routing and Syn flooding attacks, the standard BF does not perform well. We further design the double-deletion bloom filter (DDBF) scheme, which alleviates potential hash collisions of the standard BF by explicitly deleting used records and implicitly deleting out-of-date records. Because of these double deletion operations, the DDBF accurately estimates the RTT distribution of TCP flows with limited memory space, even with the appearance of multipath routing and Syn flooding attacks. Theoretical analysis indicates that the DDBF scheme achieves a higher accuracy with a constant and smaller amount of memory compared with the standard BF. In addition, we validate our scheme using real traces and demonstrate significant memory-savings without degrading accuracy.

  • Finding Interesting Sequential Patterns in Sequence Data Streams via a Time-Interval Weighting Approach

    Joong Hyuk CHANG  Nam Hun PARK  

     
    PAPER-Artificial Intelligence, Data Mining

      Vol:
    E96-D No:8
      Page(s):
    1734-1744

    The mining problem over data streams has recently been attracting considerable attention thanks to the usefulness of data mining in various application fields of information science, and sequence data streams are so common in daily life. Therefore, a study on mining sequential patterns over sequence data streams can give valuable results for wide use in various application fields. This paper proposes a new framework for mining novel interesting sequential patterns over a sequence data stream and a mining method based on the framework. Assuming that a sequence with small time-intervals between its data elements is more valuable than others with large time-intervals, the novel interesting sequential pattern is defined and found by analyzing the time-intervals of data elements in a sequence as well as their orders. The proposed framework is capable of obtaining more interesting sequential patterns over sequence data streams whose data elements are highly correlated in terms of generation time.

  • Design and Implementation of a Handshake Join Architecture on FPGA

    Yasin OGE  Takefumi MIYOSHI  Hideyuki KAWASHIMA  Tsutomu YOSHINAGA  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2919-2927

    A novel design is proposed to implement highly parallel stream join operators on a field-programmable gate array (FPGA), by examining handshake join algorithm for hardware implementation. The proposed design is evaluated in terms of the hardware resource usage, the maximum clock frequency, and the performance. Experimental results indicate that the proposed implementation can handle considerably high input rates, especially at low match rates. Results of simulation conducted to optimize size of buffers included in join and merge units give a new intuition regarding static and adaptive buffer tuning in handshake join.

  • PrefixSummary: A Directory Structure for Selective Probing on Wireless Stream of Heterogeneous XML Data

    Chang-Sup PARK  Jun Pyo PARK  Yon Dohn CHUNG  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E95-D No:5
      Page(s):
    1427-1435

    Wireless broadcasting of heterogeneous XML data has become popular in many applications, where energy-efficient processing of user queries at the mobile client is a critical issue. This paper proposes a new index structure for wireless stream of heterogeneous XML data to enhance tuning time performance in processing path queries on the stream. The index called PrefixSummary stores for each location path in the XML data the address of a bucket in the stream which contains an XML node satisfying the location path and appearing first in the stream. We present algorithms to generate broadcast stream with the proposed index and to process a path query on the stream efficiently by exploiting the index. We also suggest a replication scheme of PrefixSummary within a broadcast cycle to reduce latency in query processing. By analysis and experiment we show the proposed PrefixSummary approach can reduce tuning time for processing path queries significantly while it can also achieve reasonable access time performance by means of replication of the index over the broadcast stream.

  • Economical and Fault-Tolerant Load Balancing in Distributed Stream Processing Systems

    Fuyuan XIAO  Teruaki KITASUKA  Masayoshi ARITSUGI  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E95-D No:4
      Page(s):
    1062-1073

    We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.

  • Adaptive Online Prediction Using Weighted Windows

    Shin-ichi YOSHIDA  Kohei HATANO  Eiji TAKIMOTO  Masayuki TAKEDA  

     
    PAPER

      Vol:
    E94-D No:10
      Page(s):
    1917-1923

    We propose online prediction algorithms for data streams whose characteristics might change over time. Our algorithms are applications of online learning with experts. In particular, our algorithms combine base predictors over sliding windows with different length as experts. As a result, our algorithms are guaranteed to be competitive with the base predictor with the best fixed-length sliding window in hindsight.

  • Temporal Coalescing on Window Extents over Data Streams

    Mohammed AL-KATEB  Sasi Sekhar KUNTA  Byung Suk LEE  

     
    PAPER

      Vol:
    E94-D No:3
      Page(s):
    489-503

    This paper focuses on the coalescing operator applied to the processing of continuous queries with temporal functions and predicates over windowed data streams. Coalescing is a key operation enabling the evaluation of interval predicates and functions on temporal tuples. Applying this operation for temporal query processing on windowed streams brings the challenge of coalescing tuples in a window extent each time the window slides over the data stream. This coalescing becomes even more involving when some tuples arrive out of order. This paper distinguishes between eager coalescing and lazy coalescing, the two known coalescing schemes. The former coalesces tuples during window extent update and the latter does it during window extent scan. With these two schemes, the paper first presents algorithms for updating a window extent for both tuple-based and time-based windows. Then, the problem of optimally selecting between eager and lazy coalescing for concurrent queries is formulated as a 0-1 integer programming problem. Through extensive performance study, the two schemes are compared and the optimal selection is demonstrated.

  • Efficient Window Processing over Disordered Data Streams

    Hyeon-Gyu KIM  Woo-Lam KANG  Myoung-Ho KIM  

     
    LETTER-Database

      Vol:
    E93-D No:3
      Page(s):
    635-638

    Bursty and out-of-order tuple arrivals complicate the process of determining contents and boundaries of sliding windows. To process windows over such streams efficiently, we need to address two issues regarding fast tuple insertion and disorder control. In this paper, we focus on these issues to process sliding windows efficiently over disordered data streams.

  • Efficient Predicate Matching over Continuous Data Streams

    Hyeon-Gyu KIM  Woo-Lam KANG  Yoon-Joon LEE  Myoung-Ho KIM  

     
    LETTER-Database

      Vol:
    E92-D No:9
      Page(s):
    1787-1790

    In this paper, we propose a predicate indexing method which handles equality and inequality tests separately. Our method uses a hash table for the equality test and a balanced binary search tree for the inequality test. Such a separate structure reduces a height of the search tree and the number of comparisons per tree node, as well as the cost for tree rebalancing. We compared our method with the IBS-tree which is one of the popular indexing methods suitable for data stream processing. Our experimental results show that the proposed method provides better insertion and search performances than the IBS-tree.

  • An Efficient Algorithm for Sliding Window-Based Weighted Frequent Pattern Mining over Data Streams

    Chowdhury Farhan AHMED  Syed Khairuzzaman TANBEER  Byeong-Soo JEONG  Young-Koo LEE  

     
    PAPER

      Vol:
    E92-D No:7
      Page(s):
    1369-1381

    Traditional frequent pattern mining algorithms do not consider different semantic significances (weights) of the items. By considering different weights of the items, weighted frequent pattern (WFP) mining becomes an important research issue in data mining and knowledge discovery area. However, the existing state-of-the-art WFP mining algorithms consider all the data from the very beginning of a database to discover the resultant weighted frequent patterns. Therefore, their approaches may not be suitable for the large-scale data environment such as data streams where the volume of data is huge and unbounded. Moreover, they cannot extract the recent change of knowledge in a data stream adaptively by considering the old information which may not be interesting in the current time period. Another major limitation of the existing algorithms is to scan a database multiple times for finding the resultant weighted frequent patterns. In this paper, we propose a novel large-scale algorithm WFPMDS (Weighted Frequent Pattern Mining over Data Streams) for sliding window-based WFP mining over data streams. By using a single scan of data stream, the WFPMDS algorithm can discover important knowledge from the recent data elements. Extensive performance analyses show that our proposed algorithm is very efficient for sliding window-based WFP mining over data streams.

  • Adaptive Continuous Query Reoptimization over Data Streams

    Hong Kyu PARK  Won Suk LEE  

     
    PAPER-Database

      Vol:
    E92-D No:7
      Page(s):
    1421-1428

    A data stream is a series of massive unbounded tuples continuously generated at a rapid rate. Continuous queries for data streams should be processed continuously, so that a strict time constraint is required. In most previous research studies, in order to guarantee this constraint, the evaluation order of join predicates in a continuous query is optimized using a greedy strategy. However, because a greedy strategy traces only the first promising plan, it often finds a suboptimal plan. To reduce the possibility of producing a suboptimal plan, in this paper, we propose an improved scheme, k-Extended Greedy Algorithm (k-EGA), that simultaneously examines a set of promising plans and reoptimize an execution plan adaptively. The number of promising plans is flexibly controlled by a user-defined range variable. The scheme verifies the performance of the current plan periodically. If the plan is no longer efficient, a newly optimized plan is generated. The performance of the proposed scheme is verified through various experiments to identify its various characteristics.

  • Finding Cardinality Heavy-Hitters in Massive Traffic Data and Its Application to Anomaly Detection

    Keisuke ISHIBASHI  Tatsuya MORI  Ryoichi KAWAHARA  Yutaka HIROKAWA  Atsushi KOBAYASHI  Kimihiro YAMAMOTO  Hitoaki SAKAMOTO  Shoichiro ASANO  

     
    PAPER-Measurement Methodology for Network Quality Such as IP, TCP and Routing

      Vol:
    E91-B No:5
      Page(s):
    1331-1339

    We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavy-hitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.

  • Rate-Sensitive Load Shedding in Data Stream Systems

    Zhiwu YIN  Shangteng HUANG  Xun FAN  

     
    LETTER-Data Mining

      Vol:
    E90-D No:7
      Page(s):
    1111-1112

    Traditional load shedding algorithms for data stream systems calculate current operator selectivity over several run periods and use them to determine where to shed load during the next run period. In this paper, we show that the current selectivity may change due to the implementation of load shedding. Our algorithm, called RLS, determines the optimum drop location by these changed selectivity rather than those pre-calculated values. Simulation results demonstrate that RLS achieves higher accuracy than traditional algorithms.

  • Decaying Obsolete Information in Finding Recent Frequent Itemsets over Data Streams

    Joong Hyuk CHANG  Won Suk LEE  

     
    LETTER-Databases

      Vol:
    E87-D No:6
      Page(s):
    1588-1592

    A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Consequently, the knowledge embedded in a data stream is likely to be changed as time goes by. However, most of mining algorithms or frequency approximation algorithms for a data stream are not able to extract the recent change of information in a data stream adaptively. This is because the obsolete information of old transactions which may be no longer useful or possibly invalid at present is regarded as important as that of recent transactions. This paper proposes an information decay method for finding recent frequent itemsets in a data stream. The effect of old transactions on the mining result of a data steam is gradually diminished as time goes by. Furthermore, the decay rate of information can be flexibly adjusted, which enables a user to define the desired life-time of the information of a transaction in a data stream.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.