IEICE globals.ieice.org Site

Author Search Result

[Author] Takao KOBAYASHI(23hit)

1-20hit(23hit)

FOREWORD Open Access
Takao KOBAYASHI

FOREWORD

Vol:
E93-D No:9
Page(s):
2347-2347
Multi-Space Probability Distribution HMM
Keiichi TOKUDA Takashi MASUKO Noboru MIYAZAKI Takao KOBAYASHI

INVITED PAPER-Pattern Recognition

Vol:
E85-D No:3
Page(s):
455-464
This paper proposes a new kind of hidden Markov model (HMM) based on multi-space probability distribution, and derives a parameter estimation algorithm for the extended HMM. HMMs are widely used statistical models for characterizing sequences of speech spectra, and have been successfully applied to speech recognition systems. HMMs are categorized into discrete HMMs and continuous HMMs, which can model sequences of discrete symbols and continuous vectors, respectively. However, we cannot apply both the conventional discrete and continuous HMMs to observation sequences which consist of continuous values and discrete symbols: F0 pattern modeling of speech is a good illustration. The proposed HMM includes discrete HMM and continuous HMM as special cases, and furthermore, can model sequences which consist of observation vectors with variable dimensionality and discrete symbols.
Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution
Chiyomi MIYAJIMA Yosuke HATTORI Keiichi TOKUDA Takashi MASUKO Takao KOBAYASHI Tadashi KITAMURA

PAPER

Vol:
E84-D No:7
Page(s):
847-855
This paper presents a new approach to modeling speech spectra and pitch for text-independent speaker identification using Gaussian mixture models based on multi-space probability distribution (MSD-GMM). MSD-GMM allows us to model continuous pitch values of voiced frames and discrete symbols for unvoiced frames in a unified framework. Spectral and pitch features are jointly modeled by a two-stream MSD-GMM. We derive maximum likelihood (ML) estimation formulae and minimum classification error (MCE) training procedure for MSD-GMM parameters. The MSD-GMM speaker models are evaluated for text-independent speaker identification tasks. The experimental results show that the MSD-GMM can efficiently model spectral and pitch features of each speaker and outperforms conventional speaker models. The results also demonstrate the utility of the MCE training of the MSD-GMM parameters and the robustness for the inter-session variability.
A 16 kb/s Wideband CELP-Based Speech Coder Using Mel-Generalized Cepstral Analysis
Kazuhito KOISHIDA Gou HIRABAYASHI Keiichi TOKUDA Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E83-D No:4
Page(s):
876-883
We propose a wideband CELP-type speech coder at 16 kb/s based on a mel-generalized cepstral (MGC) analysis technique. MGC analysis makes it possible to obtain a more accurate representation of spectral zeros compared to linear predictive (LP) analysis and take a perceptual frequency scale into account. A major advantage of the proposed coder is that the benefits of MGC representation of speech spectra can be incorporated into the CELP coding process. Subjective tests show that the proposed coder at 16 kb/s achieves a significant improvement in performance over a 16 kb/s conventional CELP coder under the same coding framework and bit allocation. Moreover, the proposed coder is found to outperform the ITU-T G. 722 standard at 64 kb/s.
Generalized Cepstral Modeling of Degraded Speech and Its Application to Speech Enhancement
Toshio KANNO Takao KOBAYASHI Satoshi IMAI

PAPER-Speech and Acoustic Signal Processing

Vol:
E76-A No:8
Page(s):
1300-1307
This paper proposes a technique for estimating speech parameters in noisy environment. The technique uses a spectral model represented by generalized cepstrum and estimates the generalized cepstral coefficients from the speech which has been degraded by additive background noise. Parameter estimation is based on maximum a posteriori (MAP) estimation procedure. An iterative approach which has been formulated for all-pole modeling is applied to the generalized cepstral modeling. Generalized cepstral coefficients are obtained by an iterative procedure that consists of the unbiased estimation of log spectrum and noncausal Wiener filtering. Since the generalized cepstral model includes the all-pole model as a special case, the technique can be viewed as a generalization of the all-pole modeling based on MAP estimation. The proposed technique is applied to the enhancement of speech and several experimental results are also shown.
Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM
Naotake NIWASE Junichi YAMAGISHI Takao KOBAYASHI

PAPER

Vol:
E88-D No:11
Page(s):
2492-2499
This paper presents a new technique for automatically synthesizing human walking motion. In the technique, a set of fundamental motion units called motion primitives is defined and each primitive is modeled statistically from motion capture data using a hidden semi-Markov model (HSMM), which is a hidden Markov model (HMM) with explicit state duration probability distributions. The mean parameter for the probability distribution function of HSMM is assumed to be given by a function of factors that control the walking pace and stride length, and a training algorithm, called factor adaptive training, is derived based on the EM algorithm. A parameter generation algorithm from motion primitive HSMMs with given control factors is also described. Experimental results for generating walking motion are presented when the walking pace and stride length are changed. The results show that the proposing technique can generate smooth and realistic motion, which are not included in the motion capture data, without the need for smoothing or interpolation.
HMM-Based Voice Conversion Using Quantized F0 Context
Takashi NOSE Yuhei OTA Takao KOBAYASHI

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2483-2490
We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.
A Training Method of Average Voice Model for HMM-Based Speech Synthesis
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI

PAPER

Vol:
E86-A No:8
Page(s):
1956-1963
This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
A Technique for Estimating Intensity of Emotional Expressions and Speaking Styles in Speech Based on Multiple-Regression HSMM
Takashi NOSE Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E93-D No:1
Page(s):
116-124
In this paper, we propose a technique for estimating the degree or intensity of emotional expressions and speaking styles appearing in speech. The key idea is based on a style control technique for speech synthesis using a multiple regression hidden semi-Markov model (MRHSMM), and the proposed technique can be viewed as the inverse of the style control. In the proposed technique, the acoustic features of spectrum, power, fundamental frequency, and duration are simultaneously modeled using the MRHSMM. We derive an algorithm for estimating explanatory variables of the MRHSMM, each of which represents the degree or intensity of emotional expressions and speaking styles appearing in acoustic features of speech, based on a maximum likelihood criterion. We show experimental results to demonstrate the ability of the proposed technique using two types of speech data, simulated emotional speech and spontaneous speech with different speaking styles. It is found that the estimated values have correlation with human perception.
Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis
Junichi YAMAGISHI Koji ONISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech Synthesis and Prosody

Vol:
E88-D No:3
Page(s):
502-509
This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.
A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM
Yusuke IJIMA Takashi NOSE Makoto TACHIBANA Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E93-D No:1
Page(s):
107-115
In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the output probability density function are controlled by a low-dimensional parameter vector, called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.
A Context Clustering Technique for Average Voice Models
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI

PAPER-Speech Synthesis and Prosody

Vol:
E86-D No:3
Page(s):
534-542
This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
A Small-Chip-Area Transceiver IC for Bluetooth Featuring a Digital Channel-Selection Filter
Masaru KOKUBO Masaaki SHIDA Takashi OSHIMA Yoshiyuki SHIBAHARA Tatsuji MATSUURA Kazuhiko KAWAI Takefumi ENDO Katsumi OSAKI Hiroki SONODA Katsumi YAMAMOTO Masaharu MATSUOKA Takao KOBAYASHI Takaaki HEMMI Junya KUDOH Hirokazu MIYAGAWA Hiroto UTSUNOMIYA Yoshiyuki EZUMI Kunio TAKAYASU Jun SUZUKI Shinya AIZAWA Mikihiko MOTOKI Yoshiyuki ABE Takao KUROSAWA Satoru OOKAWARA

PAPER

Vol:
E87-C No:6
Page(s):
878-887
We have proposed a new low-IF transceiver architecture to simultaneously achieve both a small chip area and good minimum input sensitivity. The distinctive point of the receiver architecture is that we replace the complicated high-order analog filter for channel selection with the combination of a simple low-order analog filter and a sharp digital band-pass filter. We also proposed a high-speed convergence AGC (automatic gain controller) and a demodulation block to realize the proposed digital architecture. For the transceiver, we further reduce the chip area by applying a new form of direct modulation for the VCO. Since conventional VCO direct modulation tends to suffer from variation of the modulation index with frequency, we have developed a new compensation technique that minimizes this variation, and designed the low-phase noise VCO with a new biasing method to achieve large PSRR (power-supply rejection ratio) for oscillation frequency. The test chip was fabricated in 0.35-µm BiCMOS. The chip size was 3 3 mm2; this very small area was realized by the advantages of the proposed transceiver architecture. The transceiver also achieved good minimum input sensitivity of -85 dBm and showed interference performance that satisfied the requirements of the Bluetooth standard.
Mixture Density Models Based on Mel-Cepstral Representation of Gaussian Process
Toru TAKAHASHI Keiichi TOKUDA Takao KOBAYASHI Tadashi KITAMURA

PAPER

Vol:
E86-A No:8
Page(s):
1971-1978
This paper defines a new kind of a mixture density model for modeling a quasi-stationary Gaussian process based on mel-cepstral representation. The conventional AR mixture density model can be applied to modeling a quasi-stationary Gaussian AR process. However, it cannot model spectral zeros. In contrast, the proposed model is based on a frequency-warped exponential (EX) model. Accordingly, it can represent spectral poles and zeros with equal weights, and, furthermore, the model spectrum has a high resolution at low frequencies. The parameter estimation algorithm for the proposed model was also derived based on an EM algorithm. Experimental results show that the proposed model has better performance than the AR mixture density model for modeling a frequency-warped EX process.
A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech Synthesis

Vol:
E89-D No:3
Page(s):
1092-1099
This paper proposes a technique for synthesizing speech with a desired speaking style and/or emotional expression, based on model adaptation in an HMM-based speech synthesis framework. Speaking styles and emotional expressions are characterized by many segmental and suprasegmental features in both spectral and prosodic features. Therefore, it is essential to take account of these features in the model adaptation. The proposed technique called style adaptation, deals with this issue. Firstly, the maximum likelihood linear regression (MLLR) algorithm, based on a framework of hidden semi-Markov model (HSMM) is presented to provide a mathematically rigorous and robust adaptation of state duration and to adapt both the spectral and prosodic features. Then, a novel tying method for the regression matrices of the MLLR algorithm is also presented to allow the incorporation of both the segmental and suprasegmental speech features into the style adaptation. The proposed tying method uses regression class trees with contextual information. From the results of several subjective tests, we show that these techniques can perform style adaptation while maintaining naturalness of the synthetic speech.
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER

Vol:
E88-D No:11
Page(s):
2484-2491
This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.
A Style Control Technique for HMM-Based Expressive Speech Synthesis
Takashi NOSE Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E90-D No:9
Page(s):
1406-1413
This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
Junichi YAMAGISHI Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
533-543
In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.
Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features
Kazuhito KOISHIDA Keiichi TOKUDA Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E84-D No:10
Page(s):
1427-1434
This paper proposes a vector quantization scheme which makes it possible to consider the dynamics of input vectors. In the proposed scheme, a linear transformation is applied to the consecutive input vectors and the resulting vector is quantized with a distortion measure defined by the statistics. At the decoder side, the output vector sequence is determined using the statistics associated with the transmitted indices in such a way that a likelihood is maximized. To solve the maximization problem, a computationally efficient algorithm is derived. The performance of the proposed method is evaluated in LSP parameter quantization. It is found that the LSP trajectories and the corresponding spectra change quite smoothly in the proposed method. It is also shown that the use of the proposed method results in a significant improvement of subjective quality.
Robust F₀ Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency
Dhany ARIFIANTO Tomohiro TANAKA Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E87-D No:12
Page(s):
2812-2820
Borrowing the notion of instantaneous frequency that was developed in the context of time-frequency signal analysis, an instantaneous frequency amplitude spectrum (IFAS) is introduced for estimating fundamental frequency of speech signal in both noiseless and adverse environments. We define harmonicity measure as a quantity that indicates degree of periodical regularity in the IFAS and that shows substantial difference between periodic signal and noise-like waveform. The harmonicity measure is applied to estimate the existence of fundamental frequency. We provide experimental examples to demonstrate the general applicability of the harmonicity measure and apply the proposed procedure to Japanese continuous speech signals. The results show that the proposed method outperforms the conventional methods with or without the presence of noise.

1-20hit(23hit)

Author Search Result

[Author] Takao KOBAYASHI(23hit)

FOREWORD Open Access

Multi-Space Probability Distribution HMM

Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution

A 16 kb/s Wideband CELP-Based Speech Coder Using Mel-Generalized Cepstral Analysis

Generalized Cepstral Modeling of Degraded Speech and Its Application to Speech Enhancement

Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM

HMM-Based Voice Conversion Using Quantized F0 Context

A Training Method of Average Voice Model for HMM-Based Speech Synthesis

A Technique for Estimating Intensity of Emotional Expressions and Speaking Styles in Speech Based on Multiple-Regression HSMM

Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM

A Context Clustering Technique for Average Voice Models

A Small-Chip-Area Transceiver IC for Bluetooth Featuring a Digital Channel-Selection Filter

Mixture Density Models Based on Mel-Cepstral Representation of Gaussian Process

A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

A Style Control Technique for HMM-Based Expressive Speech Synthesis

Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features

Robust F₀ Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles