Stochastic Dual Coordinate Ascent for Learning Sign Constrained Linear Predictors

Yuya TAKADA; Rikuto MOCHIDA; Miya NAKAJIMA; Syun-suke KADOYA; Daisuke SANO; Tsuyoshi KATO

doi:10.1587/transinf.2023EDP7139

Yuya TAKADA, Rikuto MOCHIDA, Miya NAKAJIMA, Syun-suke KADOYA, Daisuke SANO, Tsuyoshi KATO

1. Introduction

Machine learning problems for linear prediction are often formulated as an empirical risk minimization (ERM) problem [9]. Let $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}$ be input vectors in $\mathbb{R}^{d}$, let $\phi_{1},\dots,\phi_{n}:\mathbb{R}\to\mathbb{R}$ be convex loss functions, and let $\lambda$ be a positive regularization constant. The ERM problem discussed in this paper is described as follows:

\[\begin{equation*} \begin{aligned} \text{min}\quad & P(\boldsymbol{w}) \quad \text{wrt } \boldsymbol{w}\in\mathbb{R}^{d}, \\ \text{where }\quad & P(\boldsymbol{w}) := \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert^{2} + \frac{1}{n}\sum_{i=1}^{n}\phi_{i}(\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>). \end{aligned} \tag{1} \end{equation*}\]

Support vector machines (SVM) are recovered if we set the loss functions to the hinge loss $\phi_{i}(s)=\max(0,1-y_{i}s)$ where $y_{i}\in\{\pm 1\}$ are the class labels. Setting the loss functions to the log loss $\phi_{i}(s)=\log(1+\exp(-y_{i}s))$, logistic regression is obtained. With $y_{i}$ continuous labels, setting the square error loss function $\phi_{i}(s)=\frac{1}{2}(y_{i}-s)^{2}$ yields the ridge regression.

Recently, Tajima et al. [29] constrained the signs of the weights $\boldsymbol{w}$ to the linear SVM algorithm, and demonstrated the effectiveness of the sign constraints in the application to a biological sequence classification. The sign constraints are given to some of coefficients in the weight vector $\boldsymbol{w}=\left[w_{1},\dots,w_{d}\right]^\top$. For some pre-defined subset of indices $\mathcal{I}_{\ge}\subseteq [n]$, where $[n]:=\{1,\dots,n\}$, the non-negative constraints $w_{h}\ge 0$ are given for every $h\in\mathcal{I}_{\ge}$, and for another pre-defined subset $\mathcal{I}_{\le}\subseteq ([n]\setminus \mathcal{I}_{\ge})$, the non-positive constraints $w_{h}\le 0$ are given for every $h\in\mathcal{I}_{\le}$.

The sign constraints explicitly avoid violation of the prior knowledge for the directions of correlations between features and class labels. Negative weight coefficients $w_{h}$ are undesired if positive correlation between the $h$th features and the class label is known in advance. Nevertheless, without the sign constraints, a portion of coefficients $w_{h}$ can be negative, which degrades the generalization performance. Similarly, positive weight coefficients are unfavorable if negative correlation to the class label is known in advance. Posing the sign constraints prevent the coefficients from falling into such an unfavorable region.

In this paper, we present new optimization algorithms for the sign-constrained ERM problems. The proposed algorithms solve a dual problem instead of minimizing the primal objective directly, which enables us to use a clear termination criterion which is the difference between the primal objective and the dual objective values. When the difference between the primal objective and the dual objective values is below a threshold, the primal objective gap is ensured to be smaller than the threshold. Tajima et al. employed the Frank-Wolfe algorithm [13] for a slightly different problem in which their algorithm is specialized to the sign-constrained ERM based on the classical non-smooth hinge loss function. The proposed algorithms are based on the stochastic dual coordinate ascent (SDCA) framework [27] to solve the sign-constrained ERM formulated with smooth loss functions, where being smooth means having a Lipschitz-continuous gradient. An attractive property of the proposed algorithms is a theoretical guarantee that ensures the exponential convergence [24] upper-bounding the number of iterations to attain a sufficiently small sub-optimality.

Besides the aforementioned work reported by Tajima et al., a large potential of this kind of prior knowledge suitable to the sign constraints may exist in many applications but may not have been discovered so far. For example, in the domain of water engineering, numerous prior studies, excluding [16], have overlooked this valuable reservoir of knowledge, despite the well-established associations between various water quality metrics and microbiological concentrations. Microbiological water quality datasets often exhibit limited size due to the considerable expenses associated with data collection. It has been observed that typical water quality metrics utilized in previous studies (e.g. [16]) are indeed associated with microbiological concentrations in water. Nevertheless, these associations tend to be weak, which can result in contrasting correlations within a small dataset. Kato et al. [16] reported the effectiveness of incorporating sign constraints in regression tasks. In this study, our algorithm was applied to binary classification tasks for microbiological water quality analysis to demonstrate the power of the sign constraints.

This paper is organized as follows. Related work is discussed in the next section. In Sect. 3, the learning problem with the sign constraints is formulated and its dual problem is described. After the general SDCA framework is introduced in Sect. 4, the implementations of SDCA iterations for the sign constrained learning problem are presented in Sect. 5. The experimental results for runtime comparison and the application to microbiological water quality analysis are reported in Sect. 6, followed by the last section concluding this paper.

Page top

2. Related Work

The sign constraints have been used widely in regression and classification. Readers familiar with machine learning may recognize the sign constrained regression as one of the important components of the non-negative matrix factorization [5], [8], [18], [21], [30]. Besides it, the sign constrained least square estimation is applied to widespread applications including non-negative image restoration [11], [19], [28], [31], face representation [10], [14], microbial analysis [3], pathogenic water quality analysis [16], image super-resolution [6], spectral analysis [33], tomographic imaging [23], and sound source localization [22]. For classification, Tajima et al. [29] developed the sign-constrained support vector machines. Fernandes et al. [7] studied other loss functions in a different formulation, which penalizes the weights violating prior knowledge instead of posing sign constraints. For the square error loss function, computationally stable and fast optimization algorithms are available [2], [17], [20]. For the hinge loss function, Tajima et al. developed a Frank-Wolfe optimization algorithm [13]. Meanwhile, without sign constraints, there are many stable optimization algorithms for generic empirical risk minimization [4], [15], [25], [26], [32]. However, to the best of our knowledge, algorithms for optimizing with generic loss functions under sign constraints have not been studied well so far.

Page top

3. Primal and Dual Problems

The goal of this work is to develop an optimization algorithm for the following constrained ERM problem:

\[\begin{equation*} \begin{aligned} \text{min}\quad & P(\boldsymbol{w}) \quad \text{wrt } \boldsymbol{w}\in \mathbb{R}^{d}, \\ \text{subject to}\quad & \forall h\in\mathcal{I}_{\ge}, \quad w_{h}\ge 0, \\ & \forall h\in\mathcal{I}_{\le}, \quad w_{h}\le 0, \end{aligned} \tag{2} \end{equation*}\]

The index sets $\mathcal{I}_{\ge}$ and $\mathcal{I}_{\le}$ are assumed to be designed so that

\[\begin{equation*} \begin{aligned} \mathcal{I}_{\ge}\cup\mathcal{I}_{\le} \subseteq [d] \quad\text{ and }\quad \mathcal{I}_{\ge}\cap\mathcal{I}_{\le}=\emptyset. \end{aligned} \tag{3} \end{equation*}\]

The remaining index set $\mathcal{I}_{0}:=[d]\setminus(\mathcal{I}_{\ge}\cup\mathcal{I}_{\le})$ may be non-empty. Typically, the index of the bias term is included in $\mathcal{I}_{0}$. As previously discussed in Sect. 1, the sign constraints can be tailored based on prior knowledge. If we have advance knowledge that the $h$th feature $x_h$ in the positive class tends to be larger than in the negative class, we can apply a non-negative constraint to $w_{h}$, and conversely, if the opposite relationship holds.

Hereinafter, we do not assume any non-positive constraints within the algorithmic description, as non-positive constraints can be effectively converted into non-negative constraints. This transformation involves negating the features $x_{h,i}$ for $h\in\mathcal{I}_{\le}$ (i.e., $x_{h,i}\leftarrow -x_{h,i}$), where $x_{h,i}$ represents the $h$th component in the $i$th input vector for training, denoted as $\boldsymbol{x}_{i}\in\mathbb{R}^{d}$. After the learning process, the corresponding weight $w_{h}$ is negated to reestablish weights that satisfy $w_{h}\le 0$.

Denoting the feasible region by $\mathcal{S}$, the constrained problem in (2) can be rewritten as

\[\begin{equation*} \begin{aligned} \text{min}\quad & P(\boldsymbol{w}) \quad \text{wrt } \boldsymbol{w}\in \mathcal{S}\subseteq \mathbb{R}^{d}. \end{aligned} \tag{4} \end{equation*}\]

To express $\mathcal{S}$ simply, we use $\boldsymbol{\sigma}\in\{1,0\}^{d}$ where its $h$th entry is given by $\sigma_{h}=1$ for $h\in\mathcal{I}_{\ge}$ and $\sigma_{h}=0$ for $h\in\mathcal{I}_{0}$. Then, the primal feasible region can be re-expressed as

\[\begin{equation*} \begin{aligned} \mathcal{S} := \{ \boldsymbol{w}\in\mathbb{R}^{d} \,|\, \forall h\in[d],\, \sigma_{h}w_{h}\ge 0 \}. \end{aligned} \tag{5} \end{equation*}\]

The optimization algorithm is based on SDCA framework that maximizes the Fenchel dual of the primal objective function. The Fenchel dual [1], say $D:\mathbb{R}^{n}\to\bar{\mathbb{R}}$, where $\bar{\mathbb{R}}:=\mathbb{R}\cup\{\pm \infty\}$, is expressed as

\[\begin{equation*} \begin{aligned} D(\boldsymbol{\alpha}) := -\frac{1}{2\lambda n^{2}} \left\lVert \boldsymbol{\pi} \left( \sum_{i=1}^{n} \alpha_{i}\boldsymbol{x}_{i} \right) \right\rVert^{2} - \frac{1}{n} \sum_{i=1}^{n} \phi_{i}^{*}(-\alpha_{i}), \end{aligned} \tag{6} \end{equation*}\]

where $\boldsymbol{\alpha}:=\left[\alpha_{1},\dots,\alpha_{n}\right]^\top\in\mathbb{R}^{n}$ is a dual variable vector, $\phi_{i}^{*}:\mathbb{R}\to\bar{\mathbb{R}}$ is the convex conjugate of $\phi_{i}$, and $\boldsymbol{\pi}(\boldsymbol{v}):=\boldsymbol{v}-\max(\boldsymbol{0},-\boldsymbol{\sigma}\odot\boldsymbol{v})$. Therein, the operator $\odot$ represents the Hadamard product. The vector-valued function $\boldsymbol{\pi}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is the projection operator onto $\mathcal{S}$. If we denote by $\pi_{h}(\boldsymbol{v})$ the $h$th entry of $\boldsymbol{\pi}(\boldsymbol{v})$, it holds that $\pi_{h}(\boldsymbol{v})=0$ if $h\in\mathcal{I}_{\ge}$ and $v_{h}\le 0$; otherwise $\pi_{h}(\boldsymbol{v})=v_{h}$. The derivation of the dual function $D(\boldsymbol{\alpha})$ is given in Appendix A.

Once the maximizer of $D(\boldsymbol{\alpha})$, denoted by $\boldsymbol{\alpha}_{\star}:=\left[\alpha_{1}^{\star},\dots,\alpha_{n}^{\star}\right]^\top$, is found, the optimal solution to the primal problem (4) can be recovered by

\[\begin{equation*} \begin{aligned} \boldsymbol{w}_{\star} = \frac{1}{\lambda n}\boldsymbol{\pi}\left[\sum_{i=1}^{n}\alpha_{i}^{\star}\boldsymbol{x}_{i}\right]. \end{aligned} \tag{7} \end{equation*}\]

The loss function $\phi_{i}$ is assumed to be $1/\gamma$-smooth (i.e. $\nabla\phi_{i}$ is $1/\gamma$-Lipschitz continuous). For example, the log loss is 0.25-smooth. The quadratic hinge loss defined as

\[\begin{equation*} \begin{aligned} \phi_{i}(s) := \frac{1}{2}(\max\{ 0, 1-y_{i}s \})^2 \end{aligned} \tag{8} \end{equation*}\]

and the smoothed hinge loss defined as

\[\begin{equation*} \begin{aligned} \phi_{i}(s) := \begin{cases} \frac{1- 2y_{i} s }{2} &\quad \text{for } s<0, \\ \frac{1}{2} (\max\{ 0, 1 - y_{i} s \})^2 &\quad {\text{for }} s\ge 0 \end{cases} \end{aligned} \tag{9} \end{equation*}\]

are both 1-smooth. The convex conjugates of $(1/\gamma)$-smooth convex functions are a $\gamma$-strongly convex function [12]. That is, $\forall \eta\in[0,1]$,

\[\begin{equation*} \begin{aligned} & \eta \phi_{i}^{*}(-u) + (1-\eta) \phi_{i}^{*}(-\alpha) \\ & \ge \phi_{i}^{*}(-\eta u - (1-\eta)\alpha ) + \frac{\gamma}{2}(u-\alpha)^{2}(1-\eta)\eta. \end{aligned} \tag{10} \end{equation*}\]

The SDCA framework uses the above inequality rearranged as

\[\begin{equation*} \begin{aligned} & \phi_{i}^{*}(-\alpha)-\phi_{i}^{*}(-\alpha -\eta (u - \alpha) ) \\ & \ge (\phi_{i}^{*}(-\alpha) - \phi_{i}^{*}(-u))\eta + \frac{\gamma}{2}(u-\alpha)^{2}(1-\eta)\eta. \end{aligned} \tag{11} \end{equation*}\]

Page top

4. SDCA Framework

SDCA updates only one randomly selected entry in the dual variable vector $\boldsymbol{\alpha}$ at every iteration. Let $i$ be the index of the selected entry in $\boldsymbol{\alpha}$. Denote by $\Delta\alpha$ the difference of the randomly selected entry from the previous value: $\alpha_{i}^{(t)}:=\alpha_{i}^{(t-1)}+\Delta\alpha$. For $i'\in [n]\setminus\{i\}$, the values of the dual variables are unchanged (i.e. $\alpha_{i'}^{(t)}:=\alpha_{i'}^{(t-1)}$). Let

\[\begin{equation*} \begin{aligned} & \bar{\boldsymbol{w}}^{(t)}:=\frac{1}{\lambda n}\sum_{i'=1}^{n}\alpha_{i'}^{(t)}\boldsymbol{x}_{i'}. \quad\text{ and } \\ & \boldsymbol{w}^{(t)}:=\boldsymbol{\pi}\left[\bar{\boldsymbol{w}}^{(t)}\right]. \end{aligned} \tag{12} \end{equation*}\]

Once $\Delta\alpha$ is determined in each iteration, this vector $\bar{\boldsymbol{w}}^{(t)}$ can be updated with $O(d)$ costs, which can be seen by

\[\begin{equation*} \begin{aligned} \bar{\boldsymbol{w}}^{(t)} &= \frac{\alpha_{i}^{(t-1)}+\Delta \alpha}{\lambda n}\boldsymbol{x}_{i} + \frac{1}{\lambda n}\sum_{i'\in[n]\setminus\{i\}}^{n}\alpha_{i'}^{(t-1)}\boldsymbol{x}_{i'} \\ &= \bar{\boldsymbol{w}}^{(t-1)}+\frac{\Delta \alpha}{\lambda n}\boldsymbol{x}_{i}. \end{aligned} \tag{13} \end{equation*}\]

For the simplicity of notation, we here shall drop the superscript $(t-1)$, to denote

\[\begin{equation*} \begin{aligned} \boldsymbol{w} := \boldsymbol{w}^{(t-1)}, \quad \boldsymbol{v}_{0} := \bar{\boldsymbol{w}}^{(t-1)}, \quad \text{and} \quad \boldsymbol{\alpha} := \boldsymbol{\alpha}^{(t-1)}. \end{aligned} \tag{14} \end{equation*}\]

It is ideal to choose the maximizer of the function:

\[\begin{equation*} \begin{aligned} J_{t}^{0}(\Delta\alpha) & := D(\boldsymbol{\alpha}+\Delta\alpha\boldsymbol{e}_{i}) - D(\boldsymbol{\alpha}) \\ &= \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}\right]\right\rVert^{2} - \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}+\frac{\Delta\alpha}{\lambda n}\boldsymbol{x}_{i}\right]\right\rVert^{2} \\ & \quad + \frac{1}{n} \left( \phi_{i}^{*}(-\alpha) - \phi_{i}^{*}(-\alpha-\Delta\alpha) \right) \end{aligned} \tag{15} \end{equation*}\]

where $\boldsymbol{e}_{i}$ is the unit vector with $i$th entry one. Since $\Delta\alpha$ is still in the argument of $\phi_{i}^{*}$, finding the optimal $\Delta\alpha$ is complicated in general. To obtain a closed-form update rule, the range of $\Delta\alpha$ is restricted such that

\[\begin{equation*} \begin{aligned} \eta := -\frac{\Delta\alpha}{\alpha_{i}+\nabla\phi(\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>)} \in [0,1] \end{aligned} \tag{16} \end{equation*}\]

if $\alpha_{i}+\nabla\phi(\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>)\ne 0$; otherwise $\Delta\alpha:=0$. Hereinafter, we discuss only the non-trivial case of $u:=-\nabla\phi(\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>)\ne \alpha_{i}$, where $\alpha_{i}$ is the $i$th entry in $\boldsymbol{\alpha}^{(t-1)}$. Then, $\Delta\alpha=q\eta$ where $q:=u-\alpha_{i}$. Let

\[\begin{equation*} \begin{aligned} \boldsymbol{v}_{q} := \frac{q}{\lambda n}\boldsymbol{x}_{i}. \end{aligned} \tag{17} \end{equation*}\]

By applying the inequality (11), $J_{t}^{0}(q\eta)$ is bounded from below as

\[\begin{equation*} \begin{aligned} J_{t}^{0}(q\eta) &= \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}\right]\right\rVert^{2} - \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q}\right]\right\rVert^{2} \\ &\quad + \frac{1}{n} \left( \phi^{*}(-\alpha_{i}) - \phi^{*}(-\alpha_{i}-q\eta) \right) \\ &\ge \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}\right]\right\rVert^{2} - \frac{\lambda}{2} \left\lVert\boldsymbol{\pi}\left[\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q}\right]\right\rVert^{2} \\ & \quad + a_{\text{offs}}\eta^{2} + b_{\text{offs}} \eta =: J_{t}^{1}(\eta) \end{aligned} \tag{18} \end{equation*}\]

where

\[\begin{equation*} \begin{aligned} & a_{\text{offs}} := - \frac{q^{2}\gamma}{2n}, \quad \text{ and} \\ & b_{\text{offs}} := \frac{\phi^{*}(-\alpha) - \phi^{*}(-u) + 0.5q^{2}\gamma}{n}. \end{aligned} \tag{19} \end{equation*}\]

The lower bound $J_{t}^{1}$ is more amenable than $J_{t}^{0}$ because no loss function appears in $J_{t}^{1}$ any more. The SDCA for learning under sign constraints is summarized in Algorithm 1. The exponential convergence of SDCA is still guaranteed even if $J_{t}^{1}$ is maximized instead of $J_{t}^{0}$ [27].

Theorem 1: Let $R:=\max_{i\in[n]}\lVert\boldsymbol{x}_{i}\rVert$, $h_{\text{P}}^{(t)}:=P(\boldsymbol{w}^{(t)})-P(\boldsymbol{w}_{\star})$ and $h_{\text{D}}^{(t)}:=D(\boldsymbol{\alpha}_{\star})-D(\boldsymbol{\alpha}^{(t)})$. For any $\epsilon_{\text{P}}>0$, it holds that $\mathbb{E}\left[h_{\text{P}}^{(t)}\right]\le \epsilon_{\text{P}}$ if Algorithm 1 is run for

\[\begin{equation*} \begin{aligned} t \ge \frac{\lambda n \gamma + R^{2}}{\lambda \gamma} \log\left( \frac{h_{\text{D}}^{(0)}}{\epsilon_{\text{P}}} \frac{\lambda n \gamma + R^{2}}{\lambda \gamma} \right). \end{aligned} \tag{20} \end{equation*}\]

The proof of this theorem is given in Appendix B. The bound of the vanilla SDCA algorithm is essentially same as the above bound. In the original bound presented in [27], $h_{D}^{(0)}$ is replaced to $1$ by assuming that $\boldsymbol{\alpha}^{(0)}=\boldsymbol{0}$ and $\phi_{i}(0)\le 1$. In [27], an idea for using the hot-starting is discussed. This idea can also be applied to the proposed algorithm. In this case, the initial dual variables may be non-zero with high probability.

In the next section, how to implement Line 5 in Algorithm 1 shall be discussed.

Page top

5. Implementations for SDCA Iteration

In this section, an algorithm for finding the maximizer of $J_{t}^{1}(\eta)$ is presented. A key ingredient found in this study is the fact that $J_{t}^{1}$ is a piecewise concave quadratic function. This finding enabled us to develop efficient algorithms for the update rule. Below, an explicit form of the piecewise quadratic function shall be presented (Sect. 5.1), followed by descriptions of two algorithms to find the maximizer of $J_{t}^{1}(\eta)$ (Sects. 5.2 and 5.3).

5.1 Piecewise Quadratic Form

Denote by $v_{h,0}$ and $v_{h,q}$ the $h$th entries of $\boldsymbol{v}_{0}$ and $\boldsymbol{v}_{q}$, respectively. Let $\mathcal{I}_{0}:=\{h\in[d]\,|\,\sigma_{h}=0\}$. Define $\boldsymbol{ heta} := \left[ \theta_{1},\dots,\theta_{d_{t}},\theta_{d_{t}+1}\right]^\top$ such that $0=\theta_{1}<\dots<\theta_{d_{t}+1}=1$ where $\theta_{1},\dots,\theta_{d_{t}},\theta_{d_{t}+1}$ are the elements of a set $\Theta\subset\mathbb{R}$ such as $\text{Card}[\Theta]=d_{t}+1$ defined as

\[\begin{equation*} \begin{aligned} \Theta := & \{0,1\}\cup \big\{ \theta\in(0,1) \\ & \,\big|\, \exists h\in\mathcal{I}_{\ge} \text{ s.t.$\ $} v_{h,0} = -\theta v_{h,q} \ne 0 \big\}. \end{aligned} \tag{21} \end{equation*}\]

The element $\theta_{k}$ for $k\in\{2,\dots,d_{t}\}$ is the position at which for some $h\in\mathcal{I}_{\ge}$ the affine function $\eta\mapsto v_{h,0}+\eta v_{h,q}$ crosses the horizontal axis. Figure 1 shows a numerical example of the affine functions where $d=3$, $\mathcal{I}_{\ge} = \{ 1,2,3\}$, $\boldsymbol{v}_{0} = \left[ 0.5, 0.75, -0.5 \right]^\top$, and $\boldsymbol{v}_{q} = \left[ 0.5, -1, 1 \right]^\top$. From the definition of $\Theta$, we have $d_{t}=3$, $\theta_{1}=0$, $\theta_{2}=0.5$, $\theta_{3}=0.75$, and $\theta_{4}=1$. It is observed that the affine function $\eta\mapsto v_{3,0}+\eta v_{3,q}$ crosses the horizontal axis at $\eta=\theta_{2}$, and the affine function $\eta\mapsto v_{2,0}+\eta v_{2,q}$ crosses the horizontal axis at $\eta=\theta_{3}$.

Fig. 1 Example of affine functions $\eta\mapsto v_{h,0}+\eta v_{h,q}$.

Let us define index sets, for $k\in[d_{t}]$,

\[\begin{equation*} \begin{aligned} \mathcal{H}_{k} := \mathcal{I}_{0}\cup \big\{ & h\in\mathcal{I}_{\ge} \,\big|\, \\ & 2 v_{h,0} + (\theta_{k}+\theta_{k+1})v_{h,q} > 0 \big\}. \end{aligned} \tag{22} \end{equation*}\]

In the case of the example depicted in Fig. 1, the index sets are $\mathcal{H}_{1}=\{1,2\}$, $\mathcal{H}_{2}=\{1,2,3\}$, and $\mathcal{H}_{3}=\{1,3\}$, from the definition (22). For $h\in\mathcal{H}_{k}\cap\mathcal{I}_{\ge}$, the affine functions $\eta\mapsto v_{h,0}+\eta v_{h,q}$ are over the horizontal axis. Namely, it holds that

\[\begin{equation*} \begin{aligned} \forall\eta\in(\theta_{k},\theta_{k+1}), \quad \forall h\in\mathcal{H}_{k}, \quad v_{h,0}+\eta v_{h,q} > 0, \end{aligned} \tag{23} \end{equation*}\]

which leads to $\forall\eta\in(\theta_{k},\theta_{k+1})$,

\[\begin{equation*}\begin{aligned}[\boldsymbol{\pi}(\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q})]_{h} = \begin{cases} v_{h,0}+\eta v_{h,q} & \quad\text{ for }h \in \mathcal{H}_{k}, \\ 0 & \quad\text{ for }h \notin \mathcal{H}_{k} \end{cases} \end{aligned} \tag{24} \end{equation*}\]

where $[\boldsymbol{\pi}(\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q})]_{h}$ is the $h$th entry in the $d$-dimenstional vector $\boldsymbol{\pi}(\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q})$. The vector $\boldsymbol{ heta}$ and the sets $\mathcal{H}_{k}$ for $k\in[d_{t}]$ result in a piecewise quadratic expression for the function $J^{1}_{t}$:

\[\begin{equation*} \begin{aligned} \forall \eta\in[\theta_{k},\theta_{k+1}], \quad J^{1}_{t}(\eta) = a_{k}\eta^{2} + b_{k}\eta \end{aligned} \tag{25} \end{equation*}\]

where $a_{k}$ and $b_{k}$ are given by

\[\begin{equation*} \begin{aligned} & a_{k} = a_{\text{offs}} - \frac{\lambda}{2} \sum_{h\in\mathcal{H}_{k}} v_{h,q}^{2}, \quad \text{and} \\ & b_{k} = b_{\text{offs}} -\lambda \sum_{h\in\mathcal{H}_{k}} v_{h,q}v_{h,0}. \end{aligned} \tag{26} \end{equation*}\]

5.2 $O(d^{2})$ Implementation

Due to the concavity and the differentiability of $J_{t}^{1}$, one of the maximizers of $J_{t}^{1}(\eta)$, denoted by $\eta_{\star}$, can be found as follows.

If $\nabla J_{t}^{1}(0) = b_{1} \le 0$, then $\eta_{\star}=0$;
if $\nabla J_{t}^{1}(1) = 2a_{d_{t}+1} + b_{d_{t}+1} \ge 0$, then $\eta_{\star}=1$;
otherwise, there exists $k_{\star}\in [d_{t}]$ such that the interval $[\theta_{k_{\star}},\theta_{k_{\star}+1}]$ contains a maximizer $\eta_{\star} = -0.5 b_{k_{\star}}/a_{k_{\star}}$.

The interval index $k_{\star}$ in the third case (i.e. $2a_{d_{t}+1} + b_{d_{t}+1} < 0 < b_{1}$) can be found by checking every interval, because it holds that $\nabla J_{t}^{1}(\theta_{k_{\star}}) \ge 0 \ge \nabla J_{t}^{1}(\theta_{k_{\star}+1})$ due to the differentiability of $J_{t}^{1}$. Combining this discussion and the aforementioned observations, each iteration of SDCA can be implemented as follows.

1. Pick $i\in[n]$ at random; $\hskip36mm O(1)$.

2. Compute $\boldsymbol{v}_{0}$ and $\boldsymbol{v}_{q}$; $\hskip41mm O(d)$.

3. Determine $\Theta$; $\hskip52.75mm O(d)$.

4. Sort the elements in $\Theta$; $\hskip36.75mm O(d\log d)$.

5. Compute $\mathcal{H}_{k}$ for $k\in[d_{t}]$; $\hskip31.75mm O(d^{2})$.

6. Compute coefficients $(a_{k},b_{k})$ for $k\in[d_{t}]$; $\quad O(d^{2})$.

7. Find the maximizer $\eta_{\star}$; $\hskip37.5mm O(d)$.

8. $\Delta\alpha = q\eta_{\star}$; $\hskip55.25mm O(1)$.

9. Compute $\bar{\boldsymbol{w}}^{(t)}$ by (13); $\hskip38mm O(d)$.

This implementation enables each iteration to run within $O(d^{2})$ computational cost. The most heavy steps in this implementation are the step computing index sets $\mathcal{H}_{k}$ (i.e. Step 5) and the step computing coefficients $a_{k},b_{k}$ (i.e. Step 6), both of which pays $O(d^{2})$ cost. These time complexities are derived as follows. Observe that the number of pieces of the piecewise quadratic function is bounded as $d_{t}\le \text{Card}(\mathcal{I}_{\ge})+2 \le d+2 = O(d)$. For $k\in[d_{t}]$, each $\mathcal{H}_{k}$ is computed with $O(d)$ time since $\mathcal{H}_{k}\subseteq [d]$. Hence, it is proved that the time complexity of Step 5 is $O(d^{2})$. Since $\text{Card}(\mathcal{H}_{k})=O(d)$, computation of $2 d_{t} (=O(d))$ coefficients, $a_{1},b_{1},\dots,a_{d_{t}},b_{d_{t}}$, using (26) consumes $O(d^{2})$ cost in total.

Meanwhile, we found another algorithm that cut down the time complexity to a linear cost if ignoring the logarithmic term. The linear-time algorithm shall be presented below.

5.3 $O(d\log d)$ Implementation

Here, another algorithm that exactly maximizes $J_{t}^{1}(\eta)$ with respect to $\eta\in[0,1]$ is presented. The theoretical time complexity of the algorithm given in Sect. 5.2 is $O(d^{2})$, whereas the time complexity of the algorithm presented below is reduced to $O(d\log d)$. Defining

\[\begin{equation*} \begin{aligned} \mathcal{H}_{k,\text{in}} := \mathcal{H}_{k}\setminus\mathcal{H}_{k-1} \quad \text{ and } \quad \mathcal{H}_{k,\text{out}} := \mathcal{H}_{k-1}\setminus\mathcal{H}_{k} \end{aligned} \tag{27} \end{equation*}\]

allows us to recursively express the coefficients of the piecewise quadratic functions as $\forall k\ge 2$,

\[\begin{equation*} \begin{aligned} & a_{k} := a_{k-1} - \frac{\lambda}{2} \sum_{h\in\mathcal{H}_{k,\text{out}}} v_{h,q}^{2} + \frac{\lambda}{2} \sum_{h\in\mathcal{H}_{k,\text{in}}} v_{h,q}^{2}, \\ & b_{k} := b_{k-1} - \lambda \sum_{h\in\mathcal{H}_{k,\text{out}}} v_{h,q}v_{h,0} + \lambda \sum_{h\in\mathcal{H}_{k,\text{in}}} v_{h,q}v_{h,0}, \quad \end{aligned} \tag{28} \end{equation*}\]

To use (28) to compute $a_{k}$ and $b_{k}$, the sets $\mathcal{H}_{k,\text{in}}$ and $\mathcal{H}_{k,\text{out}}$ as well as $\mathcal{H}_{1}$ are required beforehand. The set $\mathcal{H}_{1}$ can be obtained within $O(d)$ by checking whether one of the following three conditions is satisfied:

\[\begin{equation*} \begin{aligned} & \text{(i) } h\in\mathcal{I}_{0}; \quad \text{(ii) } v_{h,0} > 0; \\ & \text{(iii) } v_{h,0} = 0 \quad\text{ and }\quad v_{h,q} > 0. \end{aligned} \tag{29} \end{equation*}\]

Namely, if $h\in[d]$ satisfies one of the three above conditions, then $h\in\mathcal{H}_{1}$; otherwise $h\notin \mathcal{H}_{1}$. We now discuss how to compute $\mathcal{H}_{k,\text{in}}$ and $\mathcal{H}_{k,\text{out}}$. To this end, we first compute $\tilde{\theta}^{\circ}_{h}:=-\frac{v_{h,0}}{v_{h,q}}$ for all $h\in\mathcal{I}_{\ge}$. The $(d_{t}-1)$ end points $\theta_{2},\dots,\theta_{d_{t}}$ are then obtained by sorting the values of $\tilde{\theta}^{\circ}_{h}$, eliminating the values outside the open interval $(0,1)$, and excluding duplicate values. During the process for computing $\boldsymbol{ heta}$, the sets $\mathcal{H}_{k,\text{in}}$ and $\mathcal{H}_{k,\text{out}}$ for $k\in\{2,\dots,d_{t}\}$ can be computed simultaneously as

\[\begin{equation*} \begin{aligned} \mathcal{H}_{k,\text{in}} &= \left\{ h \in\mathcal{I}_{\ge} \,\middle|\, \theta_{k} = \tilde{\theta}^{\circ}_{h}, \, v_{h,q} > 0 \right\}, \quad\text{and} \\ \mathcal{H}_{k,\text{out}} &= \left\{ h \in\mathcal{I}_{\ge} \,\middle|\, \theta_{k} = \tilde{\theta}^{\circ}_{h}, \, v_{h,q} < 0 \right\}. \end{aligned} \tag{30} \end{equation*}\]

From these discussions, the $O(d^{2})$ implementation given in Sect. 5.2 can be modified as follows.

1. Pick $i\in[n]$ at random; $\hskip39mm O(1)$.

2. Compute $\boldsymbol{v}_{0}$ and $\boldsymbol{v}_{q}$; $\hskip44mm O(d)$.

3. Compute $\mathcal{H}_{1}$; $\hskip55.5mm O(d)$.

4. Compute $\tilde{\theta}^{\circ}_{h}$ for $h\in\mathcal{I}_{\ge}$; $\hskip36.5mm O(d)$.

5. Compute $(\mathcal{H}_{k,\text{in}},\mathcal{H}_{k,\text{in}})$ and $\theta_{k}$ for $k\in[d_{t}]$; $\quad O(d\log d)$.

6. Compute coefficients $(a_{k},b_{k})$ for $k\in[d_{t}]$; $\hskip7.5mm O(d)$.

7. Find the maximizer $\eta_{\star}$; $\hskip40.5mm O(d)$.

8. $\Delta\alpha = q\eta_{\star}$; $\hskip58.25mm O(1)$.

9. Compute $\bar{\boldsymbol{w}}^{(t)}$ by (13); $\hskip41mm O(d)$.

Step 5 takes $O(d\log d)$ cost for sorting $\tilde{\theta}^{\circ}_{h}$ because the number of values to be sorted is $\text{Card}(\mathcal{I}_{\ge})=O(d)$. The computational cost for Line 6 is $O(d)$ since the relationship

\[\begin{equation*} \begin{aligned} \bigcup_{k=2}^{d_{t}} \mathcal{H}_{k,\text{in}} \subseteq \mathcal{I}_{+} \subseteq [d] \quad\text{ and }\quad \bigcup_{k=2}^{d_{t}} \mathcal{H}_{k,\text{out}} \subseteq \mathcal{I}_{+} \subseteq [d] \end{aligned} \tag{31} \end{equation*}\]

leads to the fact that an upper bound of the number of the total terms in (28) for all $k\in\{2,\dots,d_{t}\}$ is $4d$. Thus, it can be shown that each iteration of SDCA can be done within $O(d\log d)$ computation.

Page top

6. Experiments

6.1 Pattern Recognition Performance

We conducted experiments to demonstrate the effects of the sign constraints on the pattern recognition performance. For a pattern recognition task, we selected the microbiological water quality analysis. We used four water quality datasets named Sapporo, NY top, NY bottom, and Indian. The dataset Sapporo was provided in the supplement of [16], and contained $n_{\text{tot}}:=177$ examples, each consisting of a target variate E.coli and nine feature variates WT, pH, EC, DO, SS, BOD, TN, TP, and FR, where the abbreviations are referred to [16]. The task was to predict whether E.coli exceeds 500 MPN or not. Then, 88 positive examples and 89 negative examples were obtained. NY top, NY bottom and Indian were provided by kaggle.com. The three datasets contain 534, 523, and 896 examples, respectively. Each example has a target variate FC and five feature variates. The positive and negative class variates, respectively, were given to FC over and below the median, to pose a binary classification problem. The sign constraints were imposed as described in Table 1. Three loss functions, the smoothed hinge loss, the quadratic hinge loss, and the log loss, were examined. For each loss function, the conventional learning and the sign-constrained learning were performed. Then, six linear predictors were obtained in total. Accuracy (i.e. the sum of true positives and true negatives over the size of testing dataset) was used for the performance criterion. The number of training examples, $n$, was varied from 5 to 100. The $n$ examples were picked at random from each dataset in a stratified manner. The $n$ examples were fed to the six learning methods to get six predictors. The remaining $(n_{\text{tot}}-n)$ examples were used for testing. This procedure was repeated 200 times.

Table 1 Features and sign constraints for four datasets. Check-marked features are contained in the corresponding dataset. Sign constraints were given as described in the rightmost column where ‘$\ge 0$’ and ‘$\le 0$’ means the non-negative and non-positive constraints, respectively.

The averages of the 200 accuracies obtained for the four water quality datasets were plotted against the size of the training dataset, say $n$, in Fig. 2, Fig. 3, Fig. 4, and Fig. 5, respectively. For all four datasets and all three loss functions, the sign constraints improved the prediction performance. In particular, the improvement was more significant when training examples were fewer. Sign constraints represent a sort of the domain-specific prior knowledge, and explicitly prevent the learning from violating the prior knowledge. Without sign constraints, when the sample size is small, the signs of weights in linear predictors may often be flipped from the true signs of correlations between the features and the class label. The improvement of the generalization performance must be the effect of the sign constraints that avoided the inversion of the weight signs.

Fig. 2 Prediction performance on Sapporo dataset. The solid curves indicate the accuracies of predictors optimized under sign constraints, and the dashed curves are those the accuracies when sign constraints are not employed. The markers of the squares, the upward-pointing triangles, and the downward-pointing triangles represents the smoothed hinge loss, the quadratic hinge loss, and the log loss, respectively.

Fig. 3 Prediction performance on NY top dataset.

Fig. 4 Prediction performance on NY bottom dataset.

Fig. 5 Prediction performance on Indian dataset.

6.2 Runtimes of $O(d\log d)$ and $O(d^{2})$ Algorithms

The two algorithms, presented in Sects. 5.2 and 5.3, were implemented with Cython 3.0.0a11 and run on a Linux machine equipped with Core i7-12700K and two 16GB DDR4 SDRAM. Feature vectors were generated with uniform distribution $U(-1,1)$ and normalized. Binary class labels were generated at random with equal probabilities. The size of training examples was fixed to $n=500$. The number of features was varied from $d=10^{3}$ to $10^{5}$.

Figure 6 shows the runtimes consumed in one epoch. In conflict with the theoretical analysis, the two algorithms seemed to have the same time complexity, and the $O(d^{2})$ algorithm always ran faster than the $O(d\log d)$ algorithm. To analyze why the inconsistency between the theory and the actual runtime happened, the numbers of pieces in the piecewise quadratic functions $J_{t}^{1}(\eta)$, say $d_{t}$, were counted, where we set $d_{t}=0$ after the convergence ($P(\boldsymbol{w}(\boldsymbol{\alpha}^{(t)}))-D(\boldsymbol{\alpha}^{(t)})<10^{-6}$). The average numbers within each epoch were reported in Table 2. It was observed that the average numbers of pieces were around 1% of the number of features at the first epoch, and were less than 2.5 after the first epoch. It suggested that the actual number of pieces was much smaller than the number of features. Nevertheless, in our theoretical analysis, we used $d_{t}=O(d)$ which is based on a loose bound $d_{t}\le\text{Card}(\mathcal{I}_{\ge})\le d$, resulting in the theoretical time complexity $O(d^{2})$ for the implementation presented in Sect. 5.2. The difference of the upper bound from the actual number of pieces yielded the inconsistency between the theory and the actual runtime.

Fig. 6 Runtime for one epoch.

Table 2 Average number of pieces in the piecewise quadratic function $J_{t}^{1}:[0,1]\to\mathbb{R}$.

6.3 Convergence Performance

To assess the rapid convergence of the proposed SDCA algorithm, we conducted convergence analysis on three distinct datasets: Magic04, Segment, and Waveform. We then compared the performance of our algorithm with projected stochastic gradient descent (PSGD), updating the solution using the equation $\boldsymbol{w}^{(t+1)}:=\boldsymbol{\pi}(\boldsymbol{w}^{(t)}-\eta\nabla P(\boldsymbol{w}^{(t)}))$, where $\eta$ represents the step size.

The dataset characteristics are as follows: Magic04 contains 19,024 examples with 10 features, Segment has 2,310 examples with 19 features, and Waveform comprises 5,000 examples with 21 features. We experimented with various step sizes for PSGD, specifically $\eta=10^{0}$, $\eta=10^{-1}$, $\eta=10^{-2}$, and $\eta=10^{-3}$. The regularization constant was set to $\lambda=1/n$. In our optimization process, we employed the log-loss function $\phi_{i}$. We imposed non-negative constraints on half of the randomly selected features and non-positive constraints on the remaining half. Objective errors were monitored at each iteration. It is worth noting that assessing the true primal objective error $P(\boldsymbol{w}_{\star})$ is often impractical due to its unknown nature. To approximate this error, we executed the proposed algorithm and evaluated the dual objective value at the 1000th iteration, denoted as $T'$. While $D(\boldsymbol{\alpha}^{(T')})$ may slightly underestimate the true minimal primal objective value, we considered the quantity $P(\boldsymbol{w})-D(\boldsymbol{\alpha}^{(T')})$ for assessing the primal objective error. In all the experiments we report, we consistently observed that $P(\boldsymbol{w}^{(T')})-D(\boldsymbol{\alpha}^{(T')})$ remained below $10^{-7}$ when using the proposed algorithm, ensuring that the absolute difference between the true and approximate primal objective errors was bounded by $10^{-7}$.

In Fig. 7, we present three panels that illustrate the evolution of primal objective errors throughout multiple epochs for the datasets Magic04, Segment, and Waveform. The proposed algorithm achieved an accuracy of $10^{-5}$ after approximately 1.9, 2.7, and 3.7 epochs for the respective datasets. In contrast, PSGD did not reach this level of accuracy when using step sizes of $\eta=10^{0}$ and $\eta=10^{-3}$ for any of the datasets, due to step sizes being excessively large and small, respectively. With a step size of $\eta=10^{-2}$, PSGD eventually attained the $10^{-5}$ accuracy level, but only after a considerable number of epochs ― 429, 306, and 536 for the three datasets, respectively. These numbers of epochs were over 100 times greater than those needed by the proposed algorithm. Moreover, with a step size of $\eta=10^{-1}$, PSGD reached the $10^{-5}$ accuracy level after 79 and 109 epochs for Magic04 and Segment, respectively. However, for Waveform, PSGD did not achieve this accuracy level even after more than $10^{3}$ epochs. These findings strongly support the conclusion that the proposed algorithm converges to the optimal solution significantly faster than PSGD.

Fig. 7 Convergence behaviors on three datasets. Black solid curve: proposed algorithm, dashed curves: PSGD. Red, blue, green and yellow dash curves are obtained with step sizes $10^{0}$, $10^{-1}$, $10^{-2}$, $10^{-3}$, respectively.

In our implementation, the average time per epoch for the proposed algorithm across the three datasets is 0.488, 0.0556, and 0.146 seconds, respectively. In comparison, PSGD averages 0.141, 0.0231, and 0.0428 seconds per epoch. While the per-epoch computation time of our algorithm exceeds that of PSGD, it compensates by requiring significantly fewer epochs to achieve an accurate solution. This efficiency results in a reduced overall time to reach an accurate solution compared to PSGD.

Page top

7. Conclusions

In this paper, new algorithms for ERM under the sign constraints were presented. Tajima et al. developed the Frank-Wolfe optimization algorithm for learning SVM under sign constraints. The algorithm developed in this study extends the class of ERM problems so that an arbitrary smooth and convex loss function can be employed. The optimization algorithm is based on the SDCA framework, which inherits a favorable property that guarantees the exponential convergence which is superior to the convergence rate of the Frank-Wolfe algorithm. The effects of the sign constraints on the pattern recognition were demonstrated with simulation experiments on microbiological water quality analysis using real-world data. Actual runtimes of the two SDCA algorithms developed in this study were compared, which suggested that the simpler $O(d^{2})$ algorithm runs fast enough compared to the $O(d\log d)$ algorithm.

We expect that a significant untapped reservoir of pre-existing knowledge, amenable to sign constraints, holds promise for numerous applications but remains yet to be fully explored. Tajima et al. identified a viable application in the realm of bioinformatics [29], while Kato et al. discovered its utility in the field of water engineering [16]. In this paper, we revisited the water engineering problem to exemplify the efficacy of sign constraints. Subsequent research endeavors will involve exploring additional domains amenable to sign constraints.

Page top

Acknowledgments

This work was supported by the Environment Research and Technology Development Fund (5-2006) of the Environmental Restoration and Conservation Agency of Japan and JSPS KAKENHI Grant Number 22K04372.

Page top

References

[1] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, 1999.

[2] M. Bierlaire, Ph.L. Toint, and D. Tuyttens, “On iterative algorithms for linear least squares problems with bound constraints,” Linear Algebra and its Applications, vol.143, pp.111-143, Jan. 1991. doi: 10.1016/0024-3795(91)90009-l.
CrossRef

[3] Y. Cai, H. Gu, and T. Kenney, “Learning microbial community structures with supervised and unsupervised non-negative matrix factorization,” Microbiome, vol.5, no.110, Aug. 2017.
CrossRef

[4] Aaron Defazio, Francis Bach, and Simon Lacoste-julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” In Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pp.1646-1654. Curran Associates, Inc., 2014.

[5] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix tri-factorizations for clustering,” Proc. 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD06. ACM Press, 2006. doi:10.1145/1150402.1150420.
CrossRef

[6] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li, “Hyperspectral image super-resolution via non-negative structured sparse representation,” IEEE Trans. Image Process., vol.25, no.5, pp.2337-2352, May 2016.
CrossRef

[7] K. Fernandes and J.S. Cardoso, “Hypothesis transfer learning based on structural model similarity,” Neural Computing and Applications, vol.31, no.8, pp.3417-3430, Nov. 2017. doi:10.1007/s00521-017-3281-4.
CrossRef

[8] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β-divergence,” Neural Computation, vol.23, no.9, pp.2421-2456, Sep. 2011. doi:10.1162/neco_a_00168.
CrossRef

[9] T. Hastie, R. Tibshirani, and J. Friedman, “The Elements of Statistical Learning - Data Mining, Inference, and Prediction, Springer, 2nd edition, 2009.

[10] R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong, “Two-stage nonnegative sparse representation for large-scale face recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol.24, no.1, pp.35-46, Jan. 2013.
CrossRef

[11] S. Henrot, S. Moussaoui, C. Soussen, and D. Brie, “Edge-preserving nonnegative hyperspectral image restoration,” 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing. IEEE, May 2013. doi: 10.1109/icassp.2013.6637926.
CrossRef

[12] J.-B. Hiriart-Urruty, “Fundamentals of Convex Analysis,” Springer, 2001.

[13] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex optimization, In Sanjoy Dasgupta and David McAllester, editors, Proc. 30th Int. Conf. Machine Learning, volume 28 of Proc. Machine Learning Research, pp.427-435, Atlanta, Georgia, USA, 17-19 June 2013. PMLR.

[14] Y. Ji, T. Lin, and H. Zha, “Mahalanobis distance based non-negative sparse representation for face recognition,” 2009 Int. Conf. Machine Learning and Applications. IEEE, Dec. 2009. doi:10.1109/icmla.2009.50.
CrossRef

[15] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction, “Advances in Neural Information Processing Systems 26: Proceedings of a meeting held Dec. 5-8, 2013, Lake Tahoe, Nevada, United States., pp.315-323, 2013.

[16] T. Kato, A. Kobayashi, W. Oishi, S.-S. Kadoya, S. Okabe, N. Ohta, M. Amarasiri, and D. Sano, “Sign-constrained linear regression for prediction of microbe concentration based on water quality datasets,” J. Water Health, vol.17, no.3, pp.404-415, June 2019.
CrossRef

[17] D. Kim, S. Sra, and I.S. Dhillon, “Tackling box-constrained optimization via a new projected quasi-newton approach. SIAM Journal on Scientific Computing, vol.32, no.6, pp.3548-3563, Jan. 2010. doi:10.1137/08073812x.
CrossRef

[18] K. Kimura, M. Kudo, and Y. Tanaka, “A column-wise update algorithm for nonnegative matrix factorization in bregman divergence with an orthogonal constraint,” Machine Learning, vol.103, no.2, pp.285-306, March 2016. doi:10.1007/s10994-016-5553-0.
CrossRef

[19] G. Landi and E.L. Piccolomini, “NPTool: a Matlab software for nonnegative image restoration with Newton projection methods,” Numerical Algorithms, vol.62, no.3, pp.487-504, June 2012. doi: 10.1007/s11075-012-9602-x.
CrossRef

[20] C.L. Lawson and R.J. Hanson, “Solving Least Squares Problems,” Society for Industrial and Applied Mathematics, Jan. 1995. doi:10.1137/1.9781611971217.
CrossRef

[21] D.D Lee and H.S. Seung, “Algorithms for non-negative matrix factorization” Advances in neural information processing systems, pp.556-562, 2001.

[22] Yuanqing Lin, D.D. Lee, and L.K. Saul. Nonnegative deconvolution for time of arrival estimation. In 2004 IEEE Int. Conf. Acoustics, Speech, and Signal Processing. IEEE, 2004. doi:10.1109/icassp.2004.1326273.
CrossRef

[23] J. Ma, “Algorithms for non-negatively constrained maximum penalized likelihood reconstruction in tomographic imaging,” Algorithms, 6(1):136-160, March 2013. doi: 10.3390/a6010136.
CrossRef

[24] Y. Nesterov, “Introductory Lectures on Convex Optimization: A Basic Course,” Kluwer Academic Publishers, 2003.
CrossRef

[25] N.L. Roux, M. Schmidt, and F.R. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pp.2663-2671. Curran Associates, Inc., 2012.

[26] M. Schmidt, N. Le Roux, and F. Bach, “Erratum to: Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, vol.162, no.1-2, 83-112, June 2016. doi:10.1007/s10107-016-1030-6.
CrossRef

[27] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss, J. Mach. Learn. Res., vol.14, no.1, pp.567-599, Feb. 2013.
CrossRef

[28] A. Shashua and T. Hazan, “Non-negative tensor factorization with applications to statistics and computer vision,” Proc. 22nd Int. Conf. Machine learning - ICML’05, ACM Press, 2005. doi:10.1145/1102351.1102451.
CrossRef

[29] K. Tajima, K. Tsuchida, E.R.R. Zara, N. Ohta, and T. Kato, “Learning sign-constrained support vector machines,” 2020 25th Int. Conf. Pattern Recognition (ICPR). IEEE, Jan. 2021. doi:10.1109/icpr48806.2021.9412786.
CrossRef

[30] J. Wang, F. Tian, H. Yu, C.H. Liu, K. Zhan, and X. Wang, “Diverse non-negative matrix factorization for multiview data representation,” IEEE Trans Cybern, vol.48, no.9, pp.2620-2632 2017.
CrossRef

[31] Y. Wang and S. Ma, “Projected barzilai-borwein method for large-scale nonnegative image restoration,” Inverse Problems in Science and Engineering, vol.15, no.6, pp.559-583, Sep. 2007. doi: 10.1080/17415970600881897.
CrossRef

[32] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol.24, no.4, pp.2057-2075, Jan. 2014. doi:10.1137/140961791.
CrossRef

[33] Q. Zhang, H. Wang, R. Plemmons, and V.P. Pauca, “Spectral unmixing using nonnegative tensor factorization,” Proc. 45th annual southeast regional conference on ACM-SE 45. ACM Press, 2007. doi: 10.1145/1233341.1233449.
CrossRef

Page top

Appendix A: Deriving the Dual Function

Lemma 1: $\forall\boldsymbol{v}:=\left[v_{1},\dots,v_{d}\right]^\top\in\mathbb{R}^{d}$,

\[\begin{equation*} \begin{aligned} & \inf_{\boldsymbol{w}\in\mathcal{S}}\lVert\boldsymbol{w}-\boldsymbol{v}\rVert^{2} - \lVert\boldsymbol{v}\rVert^{2} = -\lVert\boldsymbol{\pi}(\boldsymbol{v})\rVert^{2}. \end{aligned} \tag{A$\cdot $1} \end{equation*}\]

Proof of Lemma 1: Denote by $\mathbb{R}_{\ge}$ the set of non-negative real numbers. We have

\[\begin{equation*} \begin{aligned} \min_{w_{h}\in\mathbb{R}} (w_{h}-v_{h})^{2} - v_{h}^{2} = -v_{h}^{2} \end{aligned} \tag{A$\cdot $2} \end{equation*}\]

for $h\in\mathcal{I}_{0}$, and

\[\begin{equation*} \begin{aligned} \min_{w_{h}\in\mathbb{R}_{\ge}} (w_{h}-v_{h})^{2} - v_{h}^{2} = -(v_h-\max\{0,-v_{h}\})^{2} \end{aligned} \tag{A$\cdot $3} \end{equation*}\]

for $h\in\mathcal{I}_{\ge}$. Hence,

\[\begin{equation*}\begin{aligned}\text{LHS of (A⋅1)} &= \sum_{h\in\mathcal{I}_{0}} \min_{w_{h}\in\mathbb{R}} (w_{h}-v_{h})^{2} - v_{h}^{2} \\& \qquad + \sum_{h\in\mathcal{I}_{\ge}} \min_{w_{h}\in\mathbb{R}_{\ge}} (w_{h}-v_{h})^{2} - v_{h}^{2} \\ &= -\sum_{h\in\mathcal{I}_{0}}v_{h}^{2} - \sum_{h\in\mathcal{I}_{\ge}} (v_h-\max\{0,-v_{h}\})^{2} \\ &= -\lVert\boldsymbol{\pi}(\boldsymbol{v})\rVert^{2} = \text{RHS of (A⋅1)} \end{aligned} \tag{A$\cdot $4} \end{equation*}\]

where LHS and RHS are the abbreviations of left and right hand sides, respectively.$\tag*{$\mathit{q.e.d. of\ Lemma\ 1}$}$

We now derive the dual function in (6). The primal optimization problem (4) is equivalent to the following constrained problem:

\[\begin{equation*} \begin{aligned} \text{min}\quad & \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert^{2} + \frac{1}{n}\sum_{i=1}^{n}\phi_{i}(z_{i}) \\ \text{wrt}\quad & \boldsymbol{w}\in \mathcal{S}, \, \boldsymbol{z}:=\left[z_{1},\dots,z_{n}\right]^\top\in\mathbb{R}^{n} \\ \text{subject to} \quad & \forall i\in[n], \quad z_{i}=\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>. \end{aligned} \tag{A$\cdot $5} \end{equation*}\]

Letting $\bar{\boldsymbol{w}}(\boldsymbol{\alpha}):=(\lambda n)^{-1}\sum_{i=1}^{n}\boldsymbol{x}_{i}\alpha_{i}$, the Lagrangian function is

\[\begin{equation*} \begin{aligned} & L(\boldsymbol{w},\boldsymbol{z},\boldsymbol{\alpha}) \\ &:= \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert^{2} + \frac{1}{n}\sum_{i=1}^{n}\phi_{i}(z_{i}) + \frac{1}{n}\sum_{i=1}^{n}(z_{i}-\left<\boldsymbol{w},\boldsymbol{x}_{i}\right>)\alpha_{i} \\ &= \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert^{2} - \lambda \left<\boldsymbol{w},\bar{\boldsymbol{w}}(\boldsymbol{\alpha})\right> + \frac{1}{n}\sum_{i=1}^{n}\left(\phi_{i}(z_{i}) + \alpha_{i}z_{i}\right) \\ &= \frac{\lambda}{2} \left(\lVert\boldsymbol{w}-\bar{\boldsymbol{w}}(\boldsymbol{\alpha})\rVert^{2}-\lVert\bar{\boldsymbol{w}}(\boldsymbol{\alpha})\rVert^{2}\right) + \frac{1}{n}\sum_{i=1}^{n}\left(\phi_{i}(z_{i}) + \alpha_{i}z_{i}\right). \end{aligned} \tag{A$\cdot $6} \end{equation*}\]

For the convex conjugate of the loss functions $\phi_{i}$, we have

\[\begin{equation*} \begin{aligned} -\phi^{*}_{i}(-\alpha_{i}) &= -\sup_{z_{i}\in\mathbb{R}} \left(-\alpha_{i}z_{i}-\phi_{i}(z_{i})\right) \\ &= \inf_{z_{i}\in\mathbb{R}} \left(\alpha_{i}z_{i}+\phi_{i}(z_{i})\right). \end{aligned} \tag{A$\cdot $7} \end{equation*}\]

The dual function is then obtained as

\[\begin{equation*} \begin{aligned} & D(\boldsymbol{\alpha}) = \inf_{\boldsymbol{w}\in\mathcal{S}}\inf_{\boldsymbol{z}\in\mathbb{R}^{n}} L(\boldsymbol{w},\boldsymbol{z},\boldsymbol{\alpha}) \\ &= \frac{\lambda}{2} \inf_{\boldsymbol{w}\in\mathcal{S}} \left(\lVert\boldsymbol{w}-\bar{\boldsymbol{w}}(\boldsymbol{\alpha})\rVert^{2}-\lVert\bar{\boldsymbol{w}}(\boldsymbol{\alpha})\rVert^{2}\right) \\ &\qquad + \frac{1}{n}\sum_{i=1}^{n} \inf_{z_{i}\in\mathbb{R}} \left(\phi_{i}(z_{i}) + \alpha_{i}z_{i}\right) \\ &= - \frac{\lambda}{2} \lVert\boldsymbol{\pi}(\bar{\boldsymbol{w}}(\boldsymbol{\alpha}))\rVert^{2} - \frac{1}{n}\sum_{i=1}^{n} \phi^{*}_{i}(-\alpha_{i}) \\ & = -\frac{1}{2\lambda n^{2}} \left\lVert \boldsymbol{\pi} \left( \sum_{i=1}^{n} \alpha_{i}\boldsymbol{x}_{i} \right) \right\rVert^{2} - \frac{1}{n} \sum_{i=1}^{n} \phi_{i}^{*}(-\alpha_{i}), \end{aligned} \tag{A$\cdot $8} \end{equation*}\]

where the third equality follows from Lemma 1 and (A$\cdot$ 7).

Page top

Appendix B: Proof of Theorem 1

The proof of Theorem 1 given here mainly follows the proof technique used for the vanilla SDCA in [27]. The following lemmas are important ingredients.

Lemma 2: For any $\boldsymbol{v}_{0},\boldsymbol{v}_{q}\in\mathbb{R}^{d}$ and $\eta\in[0,1]$, it holds that

\[\begin{equation*} \begin{aligned} & \lVert\boldsymbol{\pi}[\boldsymbol{v}_{0}]\rVert^{2} - \lVert\boldsymbol{\pi}[\boldsymbol{v}_{0}+\eta\boldsymbol{v}_{q}]\rVert^{2} \\ & \qquad + 2\eta\left<\boldsymbol{\pi}[\boldsymbol{v}_{0}],\boldsymbol{v}_{q}\right> + \eta^{2}\lVert\boldsymbol{v}_{q}\rVert^{2} \ge 0. \end{aligned} \tag{A$\cdot $9} \end{equation*}\]

Lemma 3: Let $F_{i}^{(t-1)} := \phi_{i}(z_{i}^{(t-1)}) + \phi^{*}_{i}(-\alpha_{i}^{(t-1)}) + z_{i}^{(t-1)}\alpha_{i}^{(t-1)}.$ It then holds that

\[\begin{equation*} \begin{aligned} h_{\text{P}}^{(t-1)} + h_{\text{D}}^{(t-1)} \le \frac{1}{n}\sum_{j=1}^{n}F_{j}^{(t-1)}. \end{aligned} \tag{A$\cdot $10} \end{equation*}\]

Proof of Lemma 2: It suffices to show that $\forall h\in[n]$,

\[\begin{equation*} \begin{aligned} & \pi_{h}[v_{h,0}]^{2} - \pi_{h}[v_{h,0}+\eta v_{h,q}]^{2} \\ & \qquad + 2\eta\pi[v_{h,0}]v_{h,q} + \eta^{2}v_{h,q}^{2} \ge 0. \end{aligned} \tag{A$\cdot $11} \end{equation*}\]

where $\pi_{h}[v_{h}] := v_{h} - \max\{0,-\sigma_{h}v_{h}\}$ for $v_{h}\in\mathbb{R}$, because the sum of LHS of (A$\cdot$ 11) over $h\in[n]$ is equal to LHS of (A$\cdot$ 9).

For $h\in\mathcal{I}_{0}$ and $h\in\mathcal{I}_{\ge}$ such that $v_{h,0}+\eta v_{h,q}\ge 0$ and $v_{h,0}\ge 0$, LHS of (A$\cdot$ 11) can be rearranged as

\[\begin{equation*} \begin{aligned} v_{h,0}^{2} - (v_{h,0}+\eta v_{h,q})^{2} + 2\eta v_{h,0}v_{h,q} + \eta^{2}v_{h,q}^{2} = 0. \end{aligned} \tag{A$\cdot $12} \end{equation*}\]

For $h\in\mathcal{I}_{\ge}$ such that $v_{h,0}+\eta v_{h,q}\le 0$ and $v_{h,0}\ge 0$, LHS of (A$\cdot$ 11) can be rearranged as

\[\begin{equation*} \begin{aligned} & v_{h,0}^{2} - 0 + 2\eta v_{h,0}v_{h,q} + \eta^{2}v_{h,q}^{2} = (v_{h,0}+\eta v_{h,q})^{2} \ge 0. \end{aligned} \tag{A$\cdot $13} \end{equation*}\]

For $h\in\mathcal{I}_{\ge}$ such that $v_{h,0}+\eta v_{h,q}\ge 0$ and $v_{h,0}\le 0$, LHS of (A$\cdot$ 11) can be rearranged as

\[\begin{equation*} \begin{aligned} 0 - (v_{h,0}+\eta v_{h,q})^{2} + 0 + \eta^{2}v_{h,q}^{2} \ge 0. \end{aligned} \tag{A$\cdot $14} \end{equation*}\]

since $0 \le v_{h,0} + \eta v_{h,q}\le \eta v_{h,q}$.$\tag*{$\mathit{q.e.d. of\ Lemma\ 2}$}$

Proof of Lemma 3: Observe that $\forall \boldsymbol{v}:=\left[v_{1},\dots,v_{d}\right]^\top \in \mathbb{R}^{d}$,

\[\begin{equation*} \begin{aligned} \lVert\boldsymbol{\pi}[\boldsymbol{v}]\rVert^{2} & = \sum_{h\in\mathcal{I}_{0}}v_{h}^{2} + \sum_{h\in\mathcal{I}_{\ge}}\pi_{h}[v_{h}]^{2} \\ &\le \sum_{h\in\mathcal{I}_{0}}v_{h}^{2} + \sum_{h\in\mathcal{I}_{\ge}}v_{h}^{2} = \lVert\boldsymbol{v}\rVert^{2} \end{aligned} \tag{A$\cdot $15} \end{equation*}\]

and

\[\begin{equation*} \begin{aligned} \lVert\bar{\boldsymbol{w}}^{(t-1)}\rVert^{2} &= \left< \bar{\boldsymbol{w}}^{(t-1)}, \frac{1}{\lambda n} \sum_{j=1}^{n} \boldsymbol{x}_{j}\alpha_{j}^{(t-1)} \right> \\ &= \frac{1}{\lambda n} \sum_{j=1}^{n} z_{j}^{(t-1)}\alpha_{j}^{(t-1)}. \end{aligned} \tag{A$\cdot $16} \end{equation*}\]

Then, these inequalities lead to

\[\begin{align} & h_{\text{P}}^{(t-1)} + h_{\text{D}}^{(t-1)} = P(\boldsymbol{\pi}[\boldsymbol{\alpha}^{(t-1)}]) - D(\boldsymbol{\alpha}^{(t-1)}) \notag \\ &= \lambda\lVert\boldsymbol{\pi}[\bar{\boldsymbol{w}}^{(t-1)}]\rVert^{2} + \frac{1}{n} \sum_{j=1}^{n} \phi_{j}(z_{j}^{(t-1)}) + \phi^{*}_{j}(-\alpha_{j}^{(t-1)}) \notag \\ & \le \lambda\lVert\bar{\boldsymbol{w}}^{(t-1)}\rVert^{2} + \frac{1}{n} \sum_{j=1}^{n} \phi_{j}(z_{j}^{(t-1)}) + \phi^{*}_{j}(-\alpha_{j}^{(t-1)}) \notag \\ & = \frac{1}{n} \sum_{j=1}^{n} F_{j}^{(t-1)}. \tag{A$\cdot $17} \end{align}\]

$\tag*{$\mathit{q.e.d. of\ Lemma\ 3}$}$We are now ready to prove Theorem 1. Let $z_{i}^{(t-1)}:=\left<\boldsymbol{w}^{(t-1)},\boldsymbol{x}_{i}\right>$, $u_{i}^{(t-1)}:=-\nabla\phi_{i}(z_{i}^{(t-1)})$, $q_{i}^{(t-1)}:=u_{i}^{(t-1)}-\alpha_{i}^{(t-1)}$, and $\bar{\beta}:=\lambda \gamma/(\lambda n \gamma+R^{2})$. Then,

\[\begin{equation*} \begin{aligned} & h_{\text{D}}^{(t-1)}-h_{\text{D}}^{(t)} \ge J^{1}_{t}(\eta_{t}) \ge J^{1}_{t}(n\bar{\beta}) \\ & \ge - 2\lambda n \bar{\beta} \left<\boldsymbol{w}^{(t-1)},\frac{\boldsymbol{x}_{i}q_{i}^{(t-1)}}{\lambda n}\right> - \lambda n^{2}\bar{\beta}^{2}\left\lVert\frac{\boldsymbol{x}_{i}q_{i}^{(t-1)}}{\lambda n}\right\rVert^{2} \\ &\qquad + a_{\text{offs}}n^{2}\bar{\beta}^{2} + b_{\text{offs}}n \bar{\beta} \\ & = \bar{\beta}F_{i}^{(t-1)} + \frac{\gamma \bar{\beta} (q_{i}^{(t-1)})^{2}}{2} {\left( 1- \frac{(\lambda n \gamma + \lVert\boldsymbol{x}_{i}\rVert^{2})\bar{\beta}}{\lambda \gamma}\right)} \\ & \ge \bar{\beta}F_{i}^{(t-1)} + \frac{\gamma \bar{\beta} (q_{i}^{(t-1)})^{2}}{2} {\left( 1- \frac{(\lambda n \gamma + R^{2})\bar{\beta}}{\lambda \gamma}\right)} = \bar{\beta}F_{i}^{(t-1)} \end{aligned} \tag{A$\cdot $18} \end{equation*}\]

where we have used Lemma 2 to get the third inequality. Taking the expectation with respect to the randomness for selection of $i\in[n]$ at $t$th iteration, we obtain

\[\begin{equation*} \begin{aligned} & h_{\text{D}}^{(t-1)}-\mathbb{E}\left[h_{\text{D}}^{(t)}\right] \ge \frac{\bar{\beta}}{n}\sum_{i=1}^{n}F_{i}^{(t-1)} \\ & \ge \bar{\beta}\cdot\left(h_{\text{P}}^{(t-1)} + h_{\text{D}}^{(t-1)} \right) \ge \bar{\beta}\cdot \max\left\{ h_{\text{P}}^{(t-1)}, h_{\text{D}}^{(t-1)} \right\} \end{aligned} \tag{A$\cdot $19} \end{equation*}\]

where the second inequality follows from Lemma 3. This leads to the bound of the expected primal objective error with respect to randomness at previous iterations:

\[\begin{equation*} \begin{aligned} \mathbb{E}[h_{\text{P}}^{(t)}] &\le \bar{\beta}^{-1} \mathbb{E}\left[ h_{\text{D}}^{(t)}-h_{\text{D}}^{(t+1)} \right] \le \bar{\beta}^{-1} \mathbb{E}\left[ h_{\text{D}}^{(t)} \right] \\ & \le \bar{\beta}^{-1} \mathbb{E}\left[ h_{\text{D}}^{(t-1)} \right] (1-\bar{\beta}) \\ &\le \bar{\beta}^{-1} h_{\text{D}}^{(0)} \cdot(1-\bar{\beta})^{t} \le \bar{\beta}^{-1} h_{\text{D}}^{(0)} \exp\left(-\bar{\beta}t\right). \end{aligned} \tag{A$\cdot $20} \end{equation*}\]

Hence, it holds that $\mathbb{E}[h_{\text{P}}^{(t)}]\le\epsilon_{\text{P}}$ conditioned on $\bar{\beta}^{-1} h_{\text{D}}^{(0)} \exp\left(-\bar{\beta}t\right) \le \epsilon_{\text{P}}$. This condition can be rearranged as

\[\begin{equation*} \begin{aligned} t \ge \frac{1}{\bar{\beta}} \log\left( \frac{h_{\text{D}}^{(0)}}{\epsilon_{\text{P}}\bar{\beta}} \right). \end{aligned} \tag{A$\cdot $21} \end{equation*}\]

$\tag*{$\mathit{q.e.d. of\ Theorem\ 1}$}$

Page top

Authors

Yuya TAKADA
Gunma University

received his B.E. degree from Gunma University, Japan in 2022. He is now a master’s student at the Graduate School of Science & Technology, Gunma University. His research interests include machine learning and data science. He is a member of IPSJ.

Rikuto MOCHIDA
Gunma University

received his B.E. degree from Gunma University, Japan in 2023. He is now a master’s student at the Graduate School of Science & Technology, Gunma University. His research interests include machine learning and data science. He is a member of IPSJ.

Miya NAKAJIMA
Gunma University

Syun-suke KADOYA
The University of Tokyo

received his B.E., and M.E. degrees from Hokkaido University, Sapporo, Japan in 2016 and 2018, respectively. He received his PhD degree from Tohoku University, Sendai, Japan in 2021. He is now a post-doc at the University of Tokyo. He is currently interested in time series analysis, infectious disease modeling, virus population genetics and bioinformatics. He is a member of JSCE and JSWE.

Daisuke SANO
Tohoku University

received his B.E., M.E., and PhD degrees from Tohoku University, Sendai, Japan, in 1998, 2000, and 2003, respectively. He took a Post-Doctoral Fellowship at Tohoku University (2003-2007) and University of Barcelona, Spain (2007-2009), and then, he got a tenure position (Associate Professor) at Hokkaido University, Sapporo, Japan, since 2009 and was managing a team of water and public health study. He moved back to Tohoku University in 2017 and is responsible for the research and supervision of international and domestic projects in the field of water and public health. He is currently interested in the statistical modeling of enteric pathogens disinfection/removal, antibiotic resistance and water, and the impact of chemical contaminants on ecosystems. He is a member of JSCE, JSWE, IWA, ASM and ACE.

Tsuyoshi KATO
Gunma University

received his B.E., M.E., and PhD degrees from Tohoku University, Sendai, Japan, in 1998, 2000, and 2003, respectively. From 2003 to 2005, he was a postdoctoral fellow at the National Institute of Advanced Industrial Science Technology (AIST) in Tokyo. From 2005 to 2008, he was an assistant professor at the University of Tokyo. He is now a full professor at the Faculty of Informatics, Gunma University. His current scientific interests include pattern recognition, non-destructive testing, water engineering and bioinformatics. He is a member of IEICE JSNDI, JSWE, and IPSJ.