1-16hit |
Young-joo CHUNG Masashi TOYODA Masaru KITSUREGAWA
In this paper, we propose a method for finding web sites whose links are hijacked by web spammers. A hijacked site is a trustworthy site that points to untrustworthy sites. To detect hijacked sites, we evaluate the trustworthiness of web sites, and examine how trustworthy sites are hijacked by untrustworthy sites in their out-neighbors. The trustworthiness is evaluated based on the difference between the white and spam scores that calculated by two modified versions of PageRank. We define two hijacked scores that measure how likely a trustworthy site is to be hijacked based on the distribution of the trustworthiness in its out-neighbors. The performance of those hijacked scores are compared using our large-scale Japanese Web archive. The results show that a better performance is obtained by the score that considers both trustworthy and untrustworthy out-neighbors, compared with the one that only considers untrustworthy out-neighbors.
Yong REN Nobuhiro KAJI Naoki YOSHINAGA Masaru KITSUREGAWA
In sentiment classification, conventional supervised approaches heavily rely on a large amount of linguistic resources, which are costly to obtain for under-resourced languages. To overcome this scarce resource problem, there exist several methods that exploit graph-based semi-supervised learning (SSL). However, fundamental issues such as controlling label propagation, choosing the initial seeds, selecting edges have barely been studied. Our evaluation on three real datasets demonstrates that manipulating the label propagating behavior and choosing labeled seeds appropriately play a critical role in adopting graph-based SSL approaches for this task.
Masashi TOYODA Masaru KITSUREGAWA
We propose a web community chart that is a tool for navigating the Web and for observing its evolution through web communities. A web community is a set of web pages created by individuals or associations with a common interest in a topic. Recent research shows that such communities can be extracted by link analysis. Our web community chart is a graph of whole communities, in which relevant communities are connected by edges. Using this chart, we can navigate through related communities. Moreover we can answer historical queries about topics on the Web and understand sociology of web community creation, by observing when and how communities emerged and evolved. We observe the evolution of communities by comparing three charts built from Japanese web archives crawled in 1999, 2000, and 2001. Several metrics are introduced for measuring the degree of community evolution, such as growth rate, novelty. Finally, we develop a web community evolution viewer that allows us to extract evolving communities using the relevance and metrics. Several evolution examples are shown using this viewer.
Yongkun WANG Kazuo GODA Miyuki NAKANO Masaru KITSUREGAWA
Flash SSDs are being incorporated in many enterprise storage platforms recently and expected to play a notable role for IO-intensive applications. However, the IO characteristics of flash SSDs are very different from those of hard disks. Since existent storage subsystems are designed on the basis of characteristics of hard disks, the IO performance of flash SSDs may not be obtained as expected. This paper provides an evaluation of flash SSDs in transaction processing systems with TPC-C benchmark. We present performance results with various configurations and describe our observations of the IO behaviors at different levels along the IO path, which helps to understand the performance of flash-based transaction processing systems and provides certain references to build flash-based systems for IO-intensive applications.
Yasuhito ASANO Takao NISHIZEKI Masashi TOYODA Masaru KITSUREGAWA
There are several methods for mining communities on the Web using hyperlinks. One of the well-known ones is a max-flow based method proposed by Flake et al. The method adopts a page-oriented framework, that is, it uses a page on the Web as a unit of information, like other methods including HITS and trawling. Recently, Asano et al. built a site-oriented framework which uses a site as a unit of information, and they experimentally showed that trawling on the site-oriented framework often outputs significantly better communities than trawling on the page-oriented framework. However, it has not been known whether the site-oriented framework is effective in mining communities through the max-flow based method. In this paper, we first point out several problems of the max-flow based method, mainly owing to the page-oriented framework, and then propose solutions to the problems by utilizing several advantages of the site-oriented framework. Computational experiments reveal that our max-flow based method on the site-oriented framework is very effective in mining communities, related to the topics of given pages, in comparison with the original max-flow based method on the page-oriented framework.
Kazuhiko MOGI Masaru KITSUREGAWA
RAID5 disk arrays provide high performance and high reliability for reasonable cost. However RAID5 suffers a performance penalty during block updates. In this paper, we propose a method to improve the small write performance of RAID5 disk arrays, named Virtual Striping. Instead of updating each block independently, this method buffers a number of updates, generates a new stripe composed of the newly updated blocks, then writes the full stripe back to disk. In order to make free space for write operations, new garbage collection strategy is employed, where the linkage of blocks in a parity stripe is changed in Virtual Striping. The LFS (log-structured file system) based storage management scheme also writes new block onto large free area, which uses copying garbage collection. In this paper, we compare the performance of both methods through simulation. Although the write cost of Virtual Striping is more than that of the LFS based method, Virtual Striping has better performance than the LFS based method. This is due to the high efficiency of garbage collection in Virtual Striping.
Masaru KITSUREGAWA Mikio TAKAGI
Several types of machines have been proposed to improve the performance of data base management systems, especially the relational one. Among them, a cellular logic type data base machine such as RAP is characterized by its simple structure. Whereas this type of machine outperforms the conventional DBMS software by orders of magnitude for the relatively light load operations such as selection and update, it exhibits poor performance for the heavy load operations such as join and projection. This is because it adopts the nested loop algorithm which is very inefficient. Join has been so far the major performance bottle neck. In this paper we propose a performance enhancement mechanism for the cellular logic data base machine. A novel relational algebra processing algorithm based on the dynamic clustering feature of hash is presented. By introducing the hashing hardware, join operation is much accelerated in comparison with the conventional cellular logic data base machine. Its execution time is evaluated by simulation experiments. It has also been a major problem to handle large relations which cannot fit into the cell memories, where frequent paging degrades the performance severaly. A bucket based staging scheme has been proposed, through which the enhanced architecture can perform the join of large relations efficiently.
Noriko IMAFUJI Masaru KITSUREGAWA
A web community is a set of web pages that provide resources on a specific topic. Various methods for finding web communities based on link analysis have been proposed in the literature. The method proposed in this paper is based on the method using the maximum flow algorithm proposed in. Our objective of using the maximum flow algorithm is to extract a subgraph which can be recognized as a good web community in the context of the quantity and the quality. This paper first discusses the features of the maximum flow algorithm based method. The previously proposed approach has a problem that a certain graph structure containing noises (i.e., irrelevant pages) is always extracted. This problem is mainly caused by edge capacities assigned a constant value. This paper proposes an assignment of variable edge capacities that are based on hub and authority scores obtained from HITS calculation. To examine the effects of our proposed method, we performed experiments using a Japanese archive crawled in February 2002. Our experimental results demonstrate that our proposed method removes noise pages caused by constant edge capacities and improves the quality of web communities.
Miyuki NAKANO Masaru KITSUREGAWA
The join operation is one of the most expensive operations in relational database systems. So far many researchers have proposed several hash-based algorithms for the join operation. In a hash-based algorithm, a large relation is first partitioned into several clusters. When clusters overflow, that is, when the size of the cluster exceeds the size of main memory, the performance of hash-based algorithms degrade substantially. Previously we proposed the GN hash algorithm which is robust in the presence of overflown clusters. The GN hash join algorithm combines the Grace hash join and hash-based nested-loop join algorithms. We analyze the performance of the GN hash join algorithm when applied to relations with a non-uniform Zipf-like data distribution. The performance is compared with other hash-based join algorithms: Grace, Hybrid, nested-loop, and simple hash join. The GN hash join algorithm is found to have higher performance on non-uniformly distributed relations. In this paper, the robustness of the GN hash algorithm from the point of choosing a run time method is verified. In the GN hash algorithm, the criterion for selecting a run time method from the two algorithm is determined by using the value calculated from the I/O cost formula of the two algorithms. This criterion cannot be guaranteed to be optimal under every data distribution, that is, the optimal criterion may change depending on the data distribution. When the data distribution is unknown, all data has to be repartitioned in order to get an accurate optimal criterion. However, from the view of choosing a method at run time, it is necessary for the GN hash algorithm to determine an appropriate criterion regardless of the data distribution. Thus, we inspect the criterion adopted in our algorithm under a simulation environment. From simulation results, we find that the range of the criterion is very wide under any data distribution and assure that the criterion determined with the assumption of a uniform data distribution can be used even when the data is highly skewed. Consequently, we can conclude that the GN hash algorithm which dynamically selects the nested-loop and Grace hash algorithms provides good performance in the presence of data skew and its performance is not sensitive to the criterion.
Yutaro BESSHO Yuto HAYAMIZU Kazuo GODA Masaru KITSUREGAWA
Parallel processing is a typical approach to answer analytical queries on large database. As the size of the database increases, we often try to increase the parallelism by incorporating more processing nodes. However, this approach increases the possibility of node failure as well. According to the conventional practice, if a failure occurs during query processing, the database system restarts the query processing from the beginning. Such temporal cost may be unacceptable to the user. This paper proposes a fault-tolerant query processing mechanism, named PhoeniQ, for analytical parallel database systems. PhoeniQ continuously takes a checkpoint for every operator pipeline and replicates the output of each stateful operator among different processing nodes. If a single processing node fails during query processing, another can promptly take over the processing. Hence, PhoneniQ allows the database system to efficiently resume query processing after a partial failure event. This paper presents a key design of PhoeniQ and prototype-based experiments to demonstrate that PhoeniQ imposes negligible performance overhead and efficiently continues query processing in the face of node failure.
Zhenglu YANG Lin LI Masaru KITSUREGAWA
Skyline query is very important because it is the basis of many applications, e.g., decision making, user-preference queries. Given an N-dimensional dataset D, a point p is said to dominate another point q if p is better than q in at least one dimension and equal to or better than q in the remaining dimensions. In this paper, we study a generalized problem of skyline query that, users are more interested in the details of the dominant relationship in a dataset, i.e., a point p dominates how many other points and whom they are. We show that the existing framework proposed in can not efficiently solve this problem. We find the interrelated connection between the partial order and the dominant relationship. Based on this discovery, we propose a new data structure, ParCube, which concisely represents the dominant relationship. We propose some effective strategies to construct ParCube. Extensive experiments illustrate the efficiency of our methods.
Yasuhito ASANO Hiroshi IMAI Masashi TOYODA Masaru KITSUREGAWA
In this paper, we present Neighbor Community Finder (NCF, for short), a tool for finding Web communities related to given URLs. While existing link-based methods of finding communities, such as HITS, trawling, and Companion, use algorithms running on a Web graph whose vertices are pages and edges are links on the Web, NCF uses an algorithm running on an inter-site graph whose vertices are sites and edges are global-links (links between sites). Since the phrase "Web site" is used ambiguously in our daily life and has no unique definition, NCF uses directory-based sites proposed by the authors as a model of Web sites. NCF receives URLs interested in by a user and constructs an inter-site graph containing neighbor sites of the given URLs by using a method of identifying directory-based sites from URL and link data obtained from the actual Web on demand. By computational experiments, we show that NCF achieves higher quality than Google's "Similar Pages" service for finding pages related to given URLs corresponding to various topics selected from among the directories of Yahoo! Japan.
Yasuhito ASANO Tsuyoshi ITO Hiroshi IMAI Masashi TOYODA Masaru KITSUREGAWA
Compact encodings of the web graph are required in order to keep the graph on the main memory and to perform operations on the graph efficiently. In this paper, we propose a new compact encoding of the web graph. It is 10% more compact than Link2 used in the Connectivity Server of Altavista and 20% more compact than the encoding proposed by Guillaume et al. in 2002 and is comparable to it in terms of extraction time.
Masaru KITSUREGAWA Weikang YANG Satoshi HIRANO Masanobu HARADA Minoru NAKAMURA Kazuhiro SUZUKI TaKayuki TAMURA Mikio TAKAGI
This paper presents an overview of the SDC-I (Super Database Computer I) developed at the University of Tokyo, Japan. The purpose of the project was to build a high performance SQL server which emphasizes query processing over transaction processing. Recently relational database systems tend to be used for heavy decision support queries, which include many join, aggregation, and order-by operations. At present high-end mainframes are used for these applications requiring several hours in some cases. While the system architecture for high traffic transaction processing systems is well established, that for adhoc query processing has not yet adequately understood. SDC-I proved that a parallel machine could attain significant performance improvements over a coventional sequential machine through the exploitation of the high degree of parallelism present in relational query processing. A unique bucket spreading parallel hash join algorithm is employed in SDC, which makes the system very robust in the presense of data skew and allows SDC to attain almost linear performance scalability. SDC adopts a hybrid parallel architecture, where globally it is a shared nothing architecture, that is, modules are connected through the multistage network, but each module itself is a symmetric multiprocessor system. Although most of the hardware elements use commodity microprocessors for improved performance to cost, only the interconnection network incorporates the special function to support our parallel relational algorithm. Data movement over the memory and the network, rather than computation, is heavy for I/O intensive database processing. A dedicated software system was carefully designed for efficient data movement. The implemented prototype consists of two modules. Its hardware and software organization is described. The performance monitoring tool was developed to visualize the system activities, which showed that SDC-I works very efficiently.
Takayuki TAMURA Masato OGUCHI Masaru KITSUREGAWA
We developed a PC cluster system which consists of 100 PCs as a test bed for massively parallel query processing. Each PC employs the 200 MHz Pentium Pro CPU and is connected with others through an ATM switch. Because the query processing applications are insensitive to the communication latency and mainly perform integer operations, the ATM connected PC cluster approach can be considered a reasonable solution for high performance database servers with low costs. However, there has been no challenge to construct large scale PC clusters for database applications, as far as the authors know. Though we employed commodity components as much as possible, we developed the DBMS itself, because that was a key component for obtaining high performance in parallel query processing, and there seemed no system which could meet our demand. On each PC node, a server program which acts as a database kernel is running to process the queries in cooperation with other nodes. The kernel was designed to execute pipelined operators and handle voluminous data efficiently, to achieve high performance on complex decision support type queries. We used the standard benchmark, TPC-D, on a 100 GB database to verify the feasibility of our approach, through comparison of our system with commercial parallel systems. As a whole, our system exhibited sufficiently high performance which was competitive with the current TPC-D top records, in spite of not using indices. For some heavy queries in the benchmark, which have high selectivity and joinability, our system performed much better. In addition, we applied transposed file organization to the database for further performance improvement. The transposed file organization vertically partitions the tuples, enabling attribute-by-attribute access to the relations. This resulted in significant performance improvement by reducing the amount of disk I/O and shifting the bottleneck to computation.