1-6hit |
Masaya NAKAHARA Shirou MARUYAMA Tetsuji KUBOYAMA Hiroshi SAKAMOTO
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
Shirou MARUYAMA Youhei TANAKA Hiroshi SAKAMOTO Masayuki TAKEDA
A framework of context-sensitive grammar transform for speeding-up compressed pattern matching (CPM) is proposed. A greedy compression algorithm with the transform model is presented as well as a Knuth-Morris-Pratt (KMP)-type compressed pattern matching algorithm. The compression ratio is a match for gzip and Re-Pair, and the search speed of our CPM algorithm is almost twice faster than the KMP-type CPM algorithm on Byte-Pair-Encoding by Shibata et al., and in the case of short patterns, faster than the Boyer-Moore-Horspool algorithm with the stopper encoding by Rautio et al., which is regarded as one of the best combinations that allows a practically fast search.
Hiroshi SAKAMOTO Shirou MARUYAMA Takuya KIDA Shinichi SHIMOZONO
A space-efficient approximation algorithm for the grammar-based compression problem, which requests for a given string to find a smallest context-free grammar deriving the string, is presented. For the input length n and an optimum CFG size g, the algorithm consumes only O(g log g) space and O(n log*n) time to achieve O((log*n)log n) approximation ratio to the optimum compression, where log*n is the maximum number of logarithms satisfying log log log n > 1. This ratio is thus regarded to almost O(log n), which is the currently best approximation ratio. While g depends on the string, it is known that g=Ω(log n) and for strings from k-letter alphabet [12].
Tatsuya ASAI Kenji ABE Shinji KAWASOE Hiroshi SAKAMOTO Hiroki ARIMURA Setsuo ARIKAWA
In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.
Shouhei FUKUNAGA Yoshimasa TAKABATAKE Tomohiro I Hiroshi SAKAMOTO
A grammar compression is a restricted context-free grammar (CFG) that derives a single string deterministically. The goal of a grammar compression algorithm is to develop a smaller CFG by finding and removing duplicate patterns, which is simply a frequent pattern discovery process. Any frequent pattern can be obtained in linear time; however, a huge working space is required for longer patterns, and the entire string must be preloaded into memory. We propose an online algorithm to address this problem approximately within compressed space. For an input sequence of symbols, a1,a2,..., let Gi be a grammar compression for the string a1a2…ai. In this study, an online algorithm is considered one that can compute Gi+1 from (Gi,ai+1) without explicitly decompressing Gi. Here, let G be a grammar compression for string S. We say that variable X approximates a substring P of S within approximation ratio δ iff for any interval [i,j] with P=S[i,j], the parse tree of G has a node labeled with X that derives S[l,r] for a subinterval [l,r] of [i,j] satisfying |[l,r]|≥δ|[i,j]|. Then, G solves the frequent pattern discovery problem approximately within δ iff for any frequent pattern P of S, there exists a variable that approximates P within δ. Here, δ is called the approximation ratio of G for S. Previously, the best approximation ratio obtained by a polynomial time algorithm was Ω(1/lg2|P|). The main contribution of this work is to present a new lower bound Ω(1/<*|S|lg|P|) that is smaller than the previous bound when lg*|S|
Kazurou HARADA Hiroshi SAKAMOTO Koosuke HARADA
A novel method of high frequency switching for dc-to-dc converter is presented. This method is based on the commutation with aid of inductor current, where a short interval (dead time) of both switches off is given for removing switching surges and switching losses. By the method of constant current-ripple using a saturable core, the condition of zero voltage switching is made independent of the duty ratio. The output voltage is regulated by pulse width modulation in the same way as that of the conventional converter of PWM control. The mechanism of commutation for giving a dead time is analyzed by assuming equivalent circuits, during which a zero voltage switching is realized. From calculations and experiments, it is found that a desired dead time is derived by connecting external capacitance to the gate-source terminal of the MOSFET.