Masashi SUGIYAMA Motoaki KAWANABE Gilles BLANCHARD Klaus-Robert MULLER
Obtaining the best linear unbiased estimator (BLUE) of noisy signals is a traditional but powerful approach to noise reduction. Explicitly computing the BLUE usually requires the prior knowledge of the noise covariance matrix and the subspace to which the true signal belongs. However, such prior knowledge is often unavailable in reality, which prevents us from applying the BLUE to real-world problems. To cope with this problem, we give a practical procedure for approximating the BLUE without such prior knowledge. Our additional assumption is that the true signal follows a non-Gaussian distribution while the noise is Gaussian.
Kernel logistic regression (KLR) is a powerful and flexible classification algorithm, which possesses an ability to provide the confidence of class prediction. However, its training--typically carried out by (quasi-)Newton methods--is rather time-consuming. In this paper, we propose an alternative probabilistic classification algorithm called Least-Squares Probabilistic Classifier (LSPC). KLR models the class-posterior probability by the log-linear combination of kernel functions and its parameters are learned by (regularized) maximum likelihood. In contrast, LSPC employs the linear combination of kernel functions and its parameters are learned by regularized least-squares fitting of the true class-posterior probability. Thanks to this linear regularized least-squares formulation, the solution of LSPC can be computed analytically just by solving a regularized system of linear equations in a class-wise manner. Thus LSPC is computationally very efficient and numerically stable. Through experiments, we show that the computation time of LSPC is faster than that of KLR by two orders of magnitude, with comparable classification accuracy.
Identifying the statistical independence of random variables is one of the important tasks in statistical data analysis. In this paper, we propose a novel non-parametric independence test based on a least-squares density ratio estimator. Our method, called least-squares independence test (LSIT), is distribution-free, and thus it is more flexible than parametric approaches. Furthermore, it is equipped with a model selection procedure based on cross-validation. This is a significant advantage over existing non-parametric approaches which often require manual parameter tuning. The usefulness of the proposed method is shown through numerical experiments.
Jaak SIMM Masashi SUGIYAMA Hirotaka HACHIYA
Reinforcement learning (RL) is a flexible framework for learning a decision rule in an unknown environment. However, a large number of samples are often required for finding a useful decision rule. To mitigate this problem, the concept of transfer learning has been employed to utilize knowledge obtained from similar RL tasks. However, most approaches developed so far are useful only in low-dimensional settings. In this paper, we propose a novel transfer learning idea that targets problems with high-dimensional states. Our idea is to transfer knowledge between state factors (e.g., interacting objects) within a single RL task. This allows the agent to learn the system dynamics of the target RL task with fewer data samples. The effectiveness of the proposed method is demonstrated through experiments.
Makoto YAMADA Masashi SUGIYAMA Gordon WICHERN Jaak SIMM
Estimating the ratio of two probability density functions (a.k.a. the importance) has recently gathered a great deal of attention since importance estimators can be used for solving various machine learning and data mining problems. In this paper, we propose a new importance estimation method using a mixture of probabilistic principal component analyzers. The proposed method is more flexible than existing approaches, and is expected to work well when the target importance function is correlated and rank-deficient. Through experiments, we illustrate the validity of the proposed approach.
Makoto YAMADA Masashi SUGIYAMA
The ratio of two probability densities is called the importance and its estimation has gathered a great deal of attention these days since the importance can be used for various data processing purposes. In this paper, we propose a new importance estimation method using Gaussian mixture models (GMMs). Our method is an extention of the Kullback-Leibler importance estimation procedure (KLIEP), an importance estimation method using linear or kernel models. An advantage of GMMs is that covariance matrices can also be learned through an expectation-maximization procedure, so the proposed method--which we call the Gaussian mixture KLIEP (GM-KLIEP)--is expected to work well when the true importance function has high correlation. Through experiments, we show the validity of the proposed approach.
Hyunha NAM Hirotaka HACHIYA Masashi SUGIYAMA
Multi-label classification allows a sample to belong to multiple classes simultaneously, which is often the case in real-world applications such as text categorization and image annotation. In multi-label scenarios, taking into account correlations among multiple labels can boost the classification accuracy. However, this makes classifier training more challenging because handling multiple labels induces a high-dimensional optimization problem. In this paper, we propose a scalable multi-label method based on the least-squares probabilistic classifier. Through experiments, we show the usefulness of our proposed method.
Kazuya UEKI Masashi SUGIYAMA Yasuyuki IHARA
Over the recent years, a great deal of effort has been made to estimate age from face images. It has been reported that age can be accurately estimated under controlled environment such as frontal faces, no expression, and static lighting conditions. However, it is not straightforward to achieve the same accuracy level in a real-world environment due to considerable variations in camera settings, facial poses, and illumination conditions. In this paper, we apply a recently proposed machine learning technique called covariate shift adaptation to alleviating lighting condition change between laboratory and practical environment. Through real-world age estimation experiments, we demonstrate the usefulness of our proposed method.
Masashi SUGIYAMA Hidemitsu OGAWA
In this paper, we consider the problem of active learning, and give a necessary and sufficient condition of sample points for the optimal generalization capability. By utilizing the properties of pseudo orthogonal bases, we clarify the mechanism of achieving the optimal generalization capability. We also show that the condition does not only provide the optimal generalization capability but also reduces the computational complexity and memory required to calculate learning result functions. Based on the optimality condition, we give design methods of optimal sample points for trigonometric polynomial models. Finally, the effectiveness of the proposed active learning method is demonstrated through computer simulations.
Masashi SUGIYAMA Keisuke SAKURAI
For obtaining a higher level of generalization capability in supervised learning, model parameters should be optimized, i.e., they should be determined in such a way that the generalization error is minimized. However, since the generalization error is inaccessible in practice, model parameters are usually determined in such a way that an estimate of the generalization error is minimized. A standard procedure for model parameter optimization is to first prepare a finite set of candidates of model parameter values, estimate the generalization error for each candidate, and then choose the best one from the candidates. If the number of candidates is increased in this procedure, the optimization quality may be improved. However, this in turn increases the computational cost. In this paper, we give methods for analytically finding the optimal model parameter value from a set of infinitely many candidates. This maximally enhances the optimization quality while the computational cost is kept reasonable.
Zhenghang CUI Issei SATO Masashi SUGIYAMA
As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is an essential task for understanding its hidden structure and predicting new contents. A biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation that does not preserve sufficient statistics. In this work, we develop a stochastic divergence minimization (SDM) inference algorithm for BTM to achieve better predictive likelihood in a scalable way. Experiments show that SDM-BTM trained by 30% data outperforms the best existing algorithm trained by full data.
Ning XIE Hirotaka HACHIYA Masashi SUGIYAMA
Oriental ink painting, called Sumi-e, is one of the most distinctive painting styles and has attracted artists around the world. Major challenges in Sumi-e simulation are to abstract complex scene information and reproduce smooth and natural brush strokes. To automatically generate such strokes, we propose to model the brush as a reinforcement learning agent, and let the agent learn the desired brush-trajectories by maximizing the sum of rewards in the policy search framework. To achieve better performance, we provide elaborate design of actions, states, and rewards specifically tailored for a Sumi-e agent. The effectiveness of our proposed approach is demonstrated through experiments on Sumi-e simulation.
Masashi SUGIYAMA Ichiro TAKEUCHI Taiji SUZUKI Takafumi KANAMORI Hirotaka HACHIYA Daisuke OKANOHARA
Estimating the conditional mean of an input-output relation is the goal of regression. However, regression analysis is not sufficiently informative if the conditional distribution has multi-modality, is highly asymmetric, or contains heteroscedastic noise. In such scenarios, estimating the conditional distribution itself would be more useful. In this paper, we propose a novel method of conditional density estimation that is suitable for multi-dimensional continuous variables. The basic idea of the proposed method is to express the conditional density in terms of the density ratio and the ratio is directly estimated without going through density estimation. Experiments using benchmark and robot transition datasets illustrate the usefulness of the proposed approach.
Squared-loss mutual information (SMI) is a robust measure of the statistical dependence between random variables. The sample-based SMI approximator called least-squares mutual information (LSMI) was demonstrated to be useful in performing various machine learning tasks such as dimension reduction, clustering, and causal inference. The original LSMI approximates the pointwise mutual information by using the kernel model, which is a linear combination of kernel basis functions located on paired data samples. Although LSMI was proved to achieve the optimal approximation accuracy asymptotically, its approximation capability is limited when the sample size is small due to an insufficient number of kernel basis functions. Increasing the number of kernel basis functions can mitigate this weakness, but a naive implementation of this idea significantly increases the computation costs. In this article, we show that the computational complexity of LSMI with the multiplicative kernel model, which locates kernel basis functions on unpaired data samples and thus the number of kernel basis functions is the sample size squared, is the same as that for the plain kernel model. We experimentally demonstrate that LSMI with the multiplicative kernel model is more accurate than that with plain kernel models in small sample cases, with only mild increase in computation time.
Hiroki ISHIGURO Takashi ISHIDA Masashi SUGIYAMA
It has been demonstrated that large-scale labeled datasets facilitate the success of machine learning. However, collecting labeled data is often very costly and error-prone in practice. To cope with this problem, previous studies have considered the use of a complementary label, which specifies a class that an instance does not belong to and can be collected more easily than ordinary labels. However, complementary labels could also be error-prone and thus mitigating the influence of label noise is an important challenge to make complementary-label learning more useful in practice. In this paper, we derive conditions for the loss function such that the learning algorithm is not affected by noise in complementary labels. Experiments on benchmark datasets with noisy complementary labels demonstrate that the loss functions that satisfy our conditions significantly improve the classification performance.
Mutual information (MI) is a standard measure of statistical dependence of random variables. However, due to the log function and the ratio of probability densities included in MI, it is sensitive to outliers. On the other hand, the L2-distance variant of MI called quadratic MI (QMI) tends to be robust against outliers because QMI is just the integral of the squared difference between the joint density and the product of marginals. In this paper, we propose a kernel least-squares QMI estimator called least-squares QMI (LSQMI) that directly estimates the density difference without estimating each density. A notable advantage of LSQMI is that its solution can be analytically and efficiently computed just by solving a system of linear equations. We then apply LSQMI to dependence-maximization clustering, and demonstrate its usefulness experimentally.
Wittawat JITKRITTUM Hirotaka HACHIYA Masashi SUGIYAMA
Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in real-world problems, has not received much attention. As an attempt to take feature interaction into account, we propose
Masashi SUGIYAMA Hidemitsu OGAWA
In many practical situations in NN learning, training examples tend to be supplied one by one. In such situations, incremental learning seems more natural than batch learning in view of the learning methods of human beings. In this paper, we propose an incremental learning method in neural networks under the projection learning criterion. Although projection learning is a linear learning method, achieving the above goal is not straightforward since it involves redundant expressions of functions with over-complete bases, which is essentially related to pseudo biorthogonal bases (or frames). The proposed method provides exactly the same learning result as that obtained by batch learning. It is theoretically shown that the proposed method is more efficient in computation than batch learning.
Masashi SUGIYAMA Daisuke IMAIZUMI Hidemitsu OGAWA
Most of the image restoration filters proposed so far include parameters that control the restoration properties. For bringing out the optimal restoration performance, these parameters should be determined so as to minimize a certain error measure such as the mean squared error (MSE) between the restored image and original image. However, this is not generally possible since the unknown original image itself is required for evaluating MSE. In this paper, we derive an estimator of MSE called the subspace information criterion (SIC), and propose determining the parameter values so that SIC is minimized. For any linear filter, SIC gives an unbiased estimate of the expected MSE over the noise. Therefore, the proposed method is valid for any linear filter. Computer simulations with the moving-average filter demonstrate that SIC gives a very accurate estimate of MSE in various situations, and the proposed procedure actually gives the optimal parameter values that minimize MSE.