Author Search Result

[Author] Keiichi TOKUDA(33hit)

1-20hit(33hit)

  • A 16 kb/s Wideband CELP-Based Speech Coder Using Mel-Generalized Cepstral Analysis

    Kazuhito KOISHIDA  Gou HIRABAYASHI  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E83-D No:4
      Page(s):
    876-883

    We propose a wideband CELP-type speech coder at 16 kb/s based on a mel-generalized cepstral (MGC) analysis technique. MGC analysis makes it possible to obtain a more accurate representation of spectral zeros compared to linear predictive (LP) analysis and take a perceptual frequency scale into account. A major advantage of the proposed coder is that the benefits of MGC representation of speech spectra can be incorporated into the CELP coding process. Subjective tests show that the proposed coder at 16 kb/s achieves a significant improvement in performance over a 16 kb/s conventional CELP coder under the same coding framework and bit allocation. Moreover, the proposed coder is found to outperform the ITU-T G. 722 standard at 64 kb/s.

  • An Extension of Separable Lattice 2-D HMMs for Rotational Data Variations

    Akira TAMAMORI  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-Pattern Recognition

      Vol:
    E95-D No:8
      Page(s):
    2074-2083

    This paper proposes a new generative model which can deal with rotational data variations by extending Separable Lattice 2-D HMMs (SL2D-HMMs). In image recognition, geometrical variations such as size, location and rotation degrade the performance. Therefore, the appropriate normalization processes for such variations are required. SL2D-HMMs can perform an elastic matching in both horizontal and vertical directions; this makes it possible to model invariance to size and location. To deal with rotational variations, we introduce additional HMM states which represent the shifts of the state alignments among the observation lines in a particular direction. Face recognition experiments show that the proposed method improves the performance significantly for rotational variation data.

  • Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

    Kazuhiro NAKAMURA  Kei HASHIMOTO  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-HMM-based Speech Synthesis

      Vol:
    E97-D No:6
      Page(s):
    1438-1448

    This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.

  • A Bayesian Framework Using Multiple Model Structures for Speech Recognition

    Sayaka SHIOTA  Kei HASHIMOTO  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-Speech and Hearing

      Vol:
    E96-D No:4
      Page(s):
    939-948

    This paper proposes an acoustic modeling technique based on Bayesian framework using multiple model structures for speech recognition. The aim of the Bayesian approach is to obtain good prediction of observation by marginalizing all variables related to generative processes. Although the effectiveness of marginalizing model parameters was recently reported in speech recognition, most of these systems use only “one” model structure, e.g., topologies of HMMs, the number of states and mixtures, types of state output distributions, and parameter tying structures. However, it is insufficient to represent a true model distribution, because a family of such models usually does not include a true distribution in most practical cases. One of solutions of this problem is to use multiple model structures. Although several approaches using multiple model structures have already been proposed, the consistent integration of multiple model structures based on the Bayesian approach has not seen in speech recognition. This paper focuses on integrating multiple phonetic decision trees based on the Bayesian framework in HMM based acoustic modeling. The proposed method is derived from a new marginal likelihood function which includes the model structures as a latent variable in addition to HMM state sequences and model parameters, and the posterior distributions of these latent variables are obtained using the variational Bayesian method. Furthermore, to improve the optimization algorithm, the deterministic annealing EM (DAEM) algorithm is applied to the training process. The proposed method effectively utilizes multiple model structures, especially in the early stage of training and this leads to better predictive distributions and improvement of recognition performance.

  • Continuous Speech Recognition Based on General Factor Dependent Acoustic Models

    Hiroyuki SUZUKI  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Vol:
    E88-D No:3
      Page(s):
    410-417

    This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.

  • A Reordering Model Using a Source-Side Parse-Tree for Statistical Machine Translation

    Kei HASHIMOTO  Hirofumi YAMAMOTO  Hideo OKUMA  Eiichiro SUMITA  Keiichi TOKUDA  

     
    PAPER-Machine Translation

      Vol:
    E92-D No:12
      Page(s):
    2386-2393

    This paper presents a reordering model using a source-side parse-tree for phrase-based statistical machine translation. The proposed model is an extension of IST-ITG (imposing source tree on inversion transduction grammar) constraints. In the proposed method, the target-side word order is obtained by rotating nodes of the source-side parse-tree. We modeled the node rotation, monotone or swap, using word alignments based on a training parallel corpus and source-side parse-trees. The model efficiently suppresses erroneous target word orderings, especially global orderings. Furthermore, the proposed method conducts a probabilistic evaluation of target word reorderings. In English-to-Japanese and English-to-Chinese translation experiments, the proposed method resulted in a 0.49-point improvement (29.31 to 29.80) and a 0.33-point improvement (18.60 to 18.93) in word BLEU-4 compared with IST-ITG constraints, respectively. This indicates the validity of the proposed reordering model.

  • A Training Method of Average Voice Model for HMM-Based Speech Synthesis

    Junichi YAMAGISHI  Masatsune TAMURA  Takashi MASUKO  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER

      Vol:
    E86-A No:8
      Page(s):
    1956-1963

    This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.

  • A Context Clustering Technique for Average Voice Models

    Junichi YAMAGISHI  Masatsune TAMURA  Takashi MASUKO  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E86-D No:3
      Page(s):
    534-542

    This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.

  • Lip Location Normalized Training for Visual Speech Recognition

    Oscar VANEGAS  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E83-D No:11
      Page(s):
    1969-1977

    This paper describes a method to normalize the lip position for improving the performance of a visual-information-based speech recognition system. Basically, there are two types of information useful in speech recognition processes; the first one is the speech signal itself and the second one is the visual information from the lips in motion. This paper tries to solve some problems caused by using images from the lips in motion such as the effect produced by the variation of the lip location. The proposed lip location normalization method is based on a search algorithm of the lip position in which the location normalization is integrated into the model training. Experiments of speaker-independent isolated word recognition were carried out on the Tulips1 and M2VTS databases. Experiments showed a recognition rate of 74.5% and an error reduction rate of 35.7% for the ten digits word recognition M2VTS database.

  • Adaptive AR Spectral Estimation Based on Wavelet Decomposition of the Linear Prediction Error

    Fernando Gil V. RESENDE Jr.  Keiichi TOKUDA  Mineo KANEKO  

     
    PAPER-Digital Signal Processing

      Vol:
    E79-A No:5
      Page(s):
    665-673

    A new adaptive AR spectral estimation method is proposed. While conventional least-squares methods use a single windowing function to analyze the linear prediction error, the proposed method uses a different window for each frequency band of the linear prediction error to define a cost function to be meinemized. With this approach, since time and frequency resolutions can be traded off throughout the frequency spectrum, an improvement on the precision of the estimates is achieved. In this paper, a wavelet-like time-frequency resolution grid is used so that low-frequency components of the linear prediction error are analyzed through long windows and high-frequency components are analyzed through short ones. To solve the optimization problem for the new cost function, special properties of the correlation matrix are used to derive an RLS algorithm on the order of M2, where M is the number of parameters of the AR model. Computer simulations comparing the performance of conventional RLS and the proposed methods are shown. In particular, it can be observed that the wavelet-based spectral estimation method gives fine frequency resolution at low frequencies and sharp time resolution at high frequencies, while with conventional methods it is possible to obtain only one of these characteristics.

  • Mixture Density Models Based on Mel-Cepstral Representation of Gaussian Process

    Toru TAKAHASHI  Keiichi TOKUDA  Takao KOBAYASHI  Tadashi KITAMURA  

     
    PAPER

      Vol:
    E86-A No:8
      Page(s):
    1971-1978

    This paper defines a new kind of a mixture density model for modeling a quasi-stationary Gaussian process based on mel-cepstral representation. The conventional AR mixture density model can be applied to modeling a quasi-stationary Gaussian AR process. However, it cannot model spectral zeros. In contrast, the proposed model is based on a frequency-warped exponential (EX) model. Accordingly, it can represent spectral poles and zeros with equal weights, and, furthermore, the model spectrum has a high resolution at low frequencies. The parameter estimation algorithm for the proposed model was also derived based on an EM algorithm. Experimental results show that the proposed model has better performance than the AR mixture density model for modeling a frequency-warped EX process.

  • A Bayesian Approach to Image Recognition Based on Separable Lattice Hidden Markov Models

    Kei SAWADA  Akira TAMAMORI  Kei HASHIMOTO  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-Pattern Recognition

      Pubricized:
    2016/09/05
      Vol:
    E99-D No:12
      Page(s):
    3119-3131

    This paper proposes a Bayesian approach to image recognition based on separable lattice hidden Markov models (SL-HMMs). The geometric variations of the object to be recognized, e.g., size, location, and rotation, are an essential problem in image recognition. SL-HMMs, which have been proposed to reduce the effect of geometric variations, can perform elastic matching both horizontally and vertically. This makes it possible to model not only invariances to the size and location of the object but also nonlinear warping in both dimensions. The maximum likelihood (ML) method has been used in training SL-HMMs. However, in some image recognition tasks, it is difficult to acquire sufficient training data, and the ML method suffers from the over-fitting problem when there is insufficient training data. This study aims to accurately estimate SL-HMMs using the maximum a posteriori (MAP) and variational Bayesian (VB) methods. The MAP and VB methods can utilize prior distributions representing useful prior information, and the VB method is expected to obtain high generalization ability by marginalization of model parameters. Furthermore, to overcome the local maximum problem in the MAP and VB methods, the deterministic annealing expectation maximization algorithm is applied for training SL-HMMs. Face recognition experiments performed on the XM2VTS database indicated that the proposed method offers significantly improved image recognition performance. Additionally, comparative experiment results showed that the proposed method was more robust to geometric variations than convolutional neural networks.

  • Deterministic Annealing EM Algorithm in Acoustic Modeling for Speaker and Speech Recognition

    Yohei ITAYA  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Vol:
    E88-D No:3
      Page(s):
    425-431

    This paper investigates the effectiveness of the DAEM (Deterministic Annealing EM) algorithm in acoustic modeling for speaker and speech recognition. Although the EM algorithm has been widely used to approximate the ML estimates, it has the problem of initialization dependence. To relax this problem, the DAEM algorithm has been proposed and confirmed the effectiveness in artificial small tasks. In this paper, we applied the DAEM algorithm to practical speech recognition tasks: speaker recognition based on GMMs and continuous speech recognition based on HMMs. Experimental results show that the DAEM algorithm can improve the recognition performance as compared to the standard EM algorithm with conventional initialization algorithms, especially in the flat start training for continuous speech recognition.

  • Applying Sparse KPCA for Feature Extraction in Speech Recognition

    Amaro LIMA  Heiga ZEN  Yoshihiko NANKAKU  Keiichi TOKUDA  Tadashi KITAMURA  Fernando G. RESENDE  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Vol:
    E88-D No:3
      Page(s):
    401-409

    This paper presents an analysis of the applicability of Sparse Kernel Principal Component Analysis (SKPCA) for feature extraction in speech recognition, as well as, a proposed approach to make the SKPCA technique realizable for a large amount of training data, which is an usual context in speech recognition systems. Although the KPCA (Kernel Principal Component Analysis) has proved to be an efficient technique for being applied to speech recognition, it has the disadvantage of requiring training data reduction, when its amount is excessively large. This data reduction is important to avoid computational unfeasibility and/or an extremely high computational burden related to the feature representation step of the training and the test data evaluations. The standard approach to perform this data reduction is to randomly choose frames from the original data set, which does not necessarily provide a good statistical representation of the original data set. In order to solve this problem a likelihood related re-estimation procedure was applied to the KPCA framework, thus creating the SKPCA, which nevertheless is not realizable for large training databases. The proposed approach consists in clustering the training data and applying to these clusters a SKPCA like data reduction technique generating the reduced data clusters. These reduced data clusters are merged and reduced in a recursive procedure until just one cluster is obtained, making the SKPCA approach realizable for a large amount of training data. The experimental results show the efficiency of SKPCA technique with the proposed approach over the KPCA with the standard sparse solution using randomly chosen frames and the standard feature extraction techniques.

  • Parameter Sharing in Mixture of Factor Analyzers for Speaker Identification

    Hiroyoshi YAMAMOTO  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Vol:
    E88-D No:3
      Page(s):
    418-424

    This paper investigates the parameter tying structures of a mixture of factor analyzers (MFA) and discriminative training of MFA for speaker identification. The parameters of factor loading matrices or diagonal matrices are shared in different mixtures of MFA. Then, minimum classification error (MCE) training is applied to the MFA parameters to enhance the discrimination ability. The result of a text-independent speaker identification experiment shows that MFA outperforms the conventional Gaussian mixture model (GMM) with diagonal or full covariance matrices and achieves the best performance when sharing the diagonal matrices, resulting in a relative gain of 26% over the GMM with diagonal covariance matrices. The improvement is more significant especially in sparse training data condition. The recognition performance is further improved by MCE training with an additional gain of 3% error reduction.

  • The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006

    Heiga ZEN  Tomoki TODA  Keiichi TOKUDA  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:6
      Page(s):
    1764-1773

    We describe a statistical parametric speech synthesis system developed by a joint group from the Nagoya Institute of Technology (Nitech) and the Nara Institute of Science and Technology (NAIST) for the annual open evaluation of text-to-speech synthesis systems named Blizzard Challenge 2006. To improve our 2005 system (Nitech-HTS 2005), we investigated new features such as mel-generalized cepstrum-based line spectral pairs (MGC-LSPs), maximum likelihood linear transform (MLLT), and a full covariance global variance (GV) probability density function (pdf). A combination of mel-cepstral coefficients, MLLT, and full covariance GV pdf scored highest in subjective listening tests, and the 2006 system performed significantly better than the 2005 system. The Blizzard Challenge 2006 evaluations show that Nitech-NAIST-HTS 2006 is competitive even when working with relatively large speech databases.

  • A Hidden Semi-Markov Model-Based Speech Synthesis System

    Heiga ZEN  Keiichi TOKUDA  Takashi MASUKO  Takao KOBAYASIH  Tadashi KITAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:5
      Page(s):
    825-834

    A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

  • Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features

    Kazuhito KOISHIDA  Keiichi TOKUDA  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:10
      Page(s):
    1427-1434

    This paper proposes a vector quantization scheme which makes it possible to consider the dynamics of input vectors. In the proposed scheme, a linear transformation is applied to the consecutive input vectors and the resulting vector is quantized with a distortion measure defined by the statistics. At the decoder side, the output vector sequence is determined using the statistics associated with the transmitted indices in such a way that a likelihood is maximized. To solve the maximization problem, a computationally efficient algorithm is derived. The performance of the proposed method is evaluated in LSP parameter quantization. It is found that the LSP trajectories and the corresponding spectra change quite smoothly in the proposed method. It is also shown that the use of the proposed method results in a significant improvement of subjective quality.

  • State Duration Modeling for HMM-Based Speech Synthesis

    Heiga ZEN  Takashi MASUKO  Keiichi TOKUDA  Takayoshi YOSHIMURA  Takao KOBAYASIH  Tadashi KITAMURA  

     
    LETTER-Speech and Hearing

      Vol:
    E90-D No:3
      Page(s):
    692-693

    This paper describes the explicit modeling of a state duration's probability density function in HMM-based speech synthesis. We redefine, in a statistically correct manner, the probability of staying in a state for a time interval used to obtain the state duration PDF and demonstrate improvements in the duration of synthesized speech.

  • On the Use of Kernel PCA for Feature Extraction in Speech Recognition

    Amaro LIMA  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E87-D No:12
      Page(s):
    2802-2811

    This paper describes an approach to feature extraction in speech recognition systems using kernel principal component analysis (KPCA). This approach represents speech features as the projection of the mel-cepstral coefficients mapped into a feature space via a non-linear mapping onto the principal components. The non-linear mapping is implicitly performed using the kernel-trick, which is a useful way of not mapping the input space into a feature space explicitly, making this mapping computationally feasible. It is shown that the application of dynamic (Δ) and acceleration (ΔΔ) coefficients, before and/or after the KPCA feature extraction procedure, is essential in order to obtain higher classification performance. Better results were obtained by using this approach when compared to the standard technique.

1-20hit(33hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.