1-17hit |
Masanori TSUJIKAWA Yoshinobu KAJIKAWA
In this paper, we propose a low-complexity and accurate noise suppression based on an a priori SNR (Speech to Noise Ratio) model for greater robustness w.r.t. short-term noise-fluctuation. The a priori SNR, the ratio of speech spectra and noise spectra in the spectral domain, represents the difference between speech features and noise features in the feature domain, including the mel-cepstral domain and the logarithmic power spectral domain. This is because logarithmic operations are used for domain conversions. Therefore, an a priori SNR model can easily be expressed in terms of the difference between the speech model and the noise model, which are modeled by the Gaussian mixture models, and it can be generated with low computational cost. By using a priori SNRs accurately estimated on the basis of an a priori SNR model, it is possible to calculate accurate coefficients of noise suppression filters taking into account the variance of noise, without serious increase in computational cost over that of a conventional model-based Wiener filter (MBW). We have conducted in-car speech recognition evaluation using the CENSREC-2 database, and a comparison of the proposed method with a conventional MBW showed that the recognition error rate for all noise environments was reduced by 9%, and that, notably, that for audio-noise environments was reduced by 11%. We show that the proposed method can be processed with low levels of computational and memory resources through implementation on a digital signal processor.
Chao-Yuan KAO Sangwook PARK Alzahra BADI David K. HAN Hanseok KO
Performance in Automatic Speech Recognition (ASR) degrades dramatically in noisy environments. To alleviate this problem, a variety of deep networks based on convolutional neural networks and recurrent neural networks were proposed by applying L1 or L2 loss. In this Letter, we propose a new orthogonal gradient penalty (OGP) method for Wasserstein Generative Adversarial Networks (WGAN) applied to denoising and despeeching models. WGAN integrates a multi-task autoencoder which estimates not only speech features but also noise features from noisy speech. While achieving 14.1% improvement in Wasserstein distance convergence rate, the proposed OGP enhanced features are tested in ASR and achieve 9.7%, 8.6%, 6.2%, and 4.8% WER improvements over DDAE, MTAE, R-CED(CNN) and RNN models.
We propose a new method to combine multiple acoustic models in Gaussian mixture model (GMM) spaces for robust speech recognition. Even though large vocabulary continuous speech recognition (LVCSR) systems are recently widespread, they often make egregious recognition errors resulting from unavoidable mismatch of speaking styles or environments between the training and real conditions. To handle this problem, a multi-style training approach has been used conventionally to train a large acoustic model by using a large speech database with various kinds of speaking styles and environment noise. But, in this work, we combine multiple sub-models trained for different speaking styles or environment noise into a large acoustic model by maximizing the log-likelihood of the sub-model states sharing the same phonetic context and position. Then the combined acoustic model is used in a new target system, which is robust to variation in speaking style and diverse environment noise. Experimental results show that the proposed method significantly outperforms the conventional methods in two tasks: Non-native English speech recognition for second-language learning systems and noise-robust point-of-interest (POI) recognition for car navigation systems.
Shin Jae KANG Kang Hyun LEE Nam Soo KIM
In this letter, we propose a novel supervised pre-training technique for deep neural network (DNN)-hidden Markov model systems to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB, and better results were observed compared to a number of conventional pre-training methods.
Yoshikazu MIYANAGA Wataru TAKAHASHI Shingo YOSHIZAWA
This paper introduces our developed noise robust speech communication techniques and describes its implementation to a smart info-media system, i.e., a small robot. Our designed speech communication system consists of automatic speech detection, recognition, and rejection. By using automatic speech detection and recognition, an observed speech waveform can be recognized without a manual trigger. In addition, using speech rejection, this system only accepts registered speech phrases and rejects any other words. In other words, although an arbitrary input speech waveform can be fed into this system and recognized, the system responds only to the registered speech phrases. The developed noise robust speech processing can reduce various noises in many environments. In addition to the design of noise robust speech recognition, the LSI design of this system has been introduced. By using the design of speech recognition application specific IC (ASIC), we can simultaneously realize low power consumption and real-time processing. This paper describes the LSI architecture of this system and its performances in some field experiments. In terms of current speech recognition accuracy, the system can realize 85-99% under 0-20dB SNR and echo environments.
Hilman PARDEDE Koji IWANO Koichi SHINODA
Spectral subtraction (SS) is an additive noise removal method which is derived in an extensive framework. In spectral subtraction, it is assumed that speech and noise spectra follow Gaussian distributions and are independent with each other. Hence, noisy speech also follows a Gaussian distribution. Spectral subtraction formula is obtained by maximizing the likelihood of noisy speech distribution with respect to its variance. However, it is well known that noisy speech observed in real situations often follows a heavy-tailed distribution, not a Gaussian distribution. In this paper, we introduce a q-Gaussian distribution in the non-extensive statistics to represent the distribution of noisy speech and derive a new spectral subtraction method based on it. We found that the q-Gaussian distribution fits the noisy speech distribution better than the Gaussian distribution does. Our speech recognition experiments using the Aurora-2 database showed that the proposed method, q-spectral subtraction (q-SS), outperformed the conventional SS method.
I propose an acoustic model adaptation method using bases constructed through the sparse principal component analysis (SPCA) of acoustic models trained in a clean environment. I perform experiments on adaptation to a new speaker and noise. The SPCA-based method outperforms the PCA-based method in the presence of babble noise.
Yanqing SUN Yu ZHOU Qingwei ZHAO Yonghong YAN
This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1 kHz and 3 kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15 dB and 0 dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.
In this paper we describe a new framework of feature combination in the cepstral domain for multi-input robust speech recognition. The general framework of working in the cepstral domain has various advantages over working in the time or hypothesis domain. It is stable, easy to maintain, and less expensive because it does not require precise calibration. It is also easy to configure in a complex speech recognition system. However, it is not straightforward to improve the recognition performance by increasing the number of inputs, and we introduce the concept of variance re-scaling to compensate the negative effect of averaging several input features. Finally, we propose to take another advantage of working in the cepstral domain. The speech can be modeled using hidden Markov models, and the model can be used as prior knowledge. This approach is formulated as a new algorithm, referred to as Hypothesis-Based Feature Combination. The effectiveness of various algorithms are evaluated using two sets of speech databases. We also refer to automatic optimization of some parameters in the proposed algorithms.
Youngjoo SUH Hoirin KIM Munchurl KIM
In this letter, we propose a new histogram equalization method to compensate for acoustic mismatches mainly caused by corruption of additive noise and channel distortion in speech recognition. The proposed method employs an improved test cumulative distribution function (CDF) by more accurately smoothing the conventional order statistics-based test CDF with the use of window functions for robust feature compensation. Experiments on the AURORA 2 framework confirmed that the proposed method is effective in compensating speech recognition features by reducing the averaged relative error by 13.12% over the order statistics-based conventional histogram equalization method and by 58.02% over the mel-cepstral-based features for the three test sets.
Longbiao WANG Seiichi NAKAGAWA Norihide KITAOKA
In a distant-talking environment, the length of channel impulse response is longer than the short-term spectral analysis window. Conventional short-term spectrum based Cepstral Mean Normalization (CMN) is therefore, not effective under these conditions. In this paper, we propose a robust speech recognition method by combining a short-term spectrum based CMN with a long-term one. We assume that a static speech segment (such as a vowel, for example) affected by reverberation, can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. The cepstral distance of neighboring frames is used to discriminate the static speech segment (long-term spectrum) and the non-static speech segment (short-term spectrum). The cepstra of the static and non-static speech segments are normalized by the corresponding cepstral means. In a previous study, we proposed an environmentally robust speech recognition method based on Position-Dependent CMN (PDCMN) to compensate for channel distortion depending on speaker position, and which is more efficient than conventional CMN. In this paper, the concept of combining short-term and long-term spectrum based CMN is extended to PDCMN. We call this Variable Term spectrum based PDCMN (VT-PDCMN). Since PDCMN/VT-PDCMN cannot normalize speaker variations because a position-dependent cepstral mean contains the average speaker characteristics over all speakers, we also combine PDCMN/VT-PDCMN with conventional CMN in this study. We conducted the experiments based on our proposed method using limited vocabulary (100 words) distant-talking isolated word recognition in a real environment. The proposed method achieved a relative error reduction rate of 60.9% over the conventional short-term spectrum based CMN and 30.6% over the short-term spectrum based PDCMN.
This letter describes a robust speech recognition system for recognizing fast speech by stretching the length of the utterance in the cepstrum domain. The degree of stretching for an utterance is determined by its rate of speech (ROS), which is based on a maximum likelihood (ML) criterion. The proposed method was evaluated on 10-digits mobile phone numbers. The results of the simulation show that the overall error rate was reduced by 17.8% when the proposed method was employed.
Tetsuo KOSAKA Masaharu KATOH Masaki KOHDA
This paper introduces new methods of robust speech recognition using discrete-mixture HMMs (DMHMMs). The aim of this work is to develop robust speech recognition for adverse conditions that contain both stationary and non-stationary noise. In particular, we focus on the issue of impulsive noise, which is a major problem in practical speech recognition system. In this paper, two strategies were utilized to solve the problem. In the first strategy, adverse conditions are represented by an acoustic model. In this case, a large amount of training data and accurate acoustic models are required to present a variety of acoustic environments. This strategy is suitable for recognition in stationary or slow-varying noise conditions. The second is based on the idea that the corrupted frames are treated to reduce the adverse effect by compensation method. Since impulsive noise has a wide variety of features and its modeling is difficult, the second strategy is employed. In order to achieve those strategies, we propose two methods. Those methods are based on DMHMM framework which is one type of discrete HMM (DHMM). First, an estimation method of DMHMM parameters based on MAP is proposed aiming to improve trainability. The second is a method of compensating the observation probabilities of DMHMMs by threshold to reduce adverse effect of outlier values. Observation probabilities of impulsive noise tend to be much smaller than those of normal speech. The motivation in this approach is that flooring the observation probability reduces the adverse effect caused by impulsive noise. Experimental evaluations on Japanese LVCSR for read newspaper speech showed that the proposed method achieved the average error rate reduction of 48.5% in impulsive noise conditions. Also the experimental results in adverse conditions that contain both stationary and impulsive noises showed that the proposed method achieved the average error rate reduction of 28.1%.
Zhipeng ZHANG Toshiaki SUGIMURA Sadaoki FURUI
This paper proposes the application of tree-structured clustering to the processing of noisy speech collected under various SNR conditions in the framework of piecewise-linear transformation (PLT)-based HMM adaptation for noisy speech. Three kinds of clustering methods are described: a one-step clustering method that integrates noise and SNR conditions and two two-step clustering methods that construct trees for each SNR condition. According to the clustering results, a noisy speech HMM is made for each node of the tree structure. Based on the likelihood maximization criterion, the HMM that best matches the input speech is selected by tracing the tree from top to bottom, and the selected HMM is further adapted by linear transformation. The proposed methods are evaluated by applying them to a Japanese dialogue recognition system. The results confirm that the proposed methods are effective in recognizing digitally noise-added speech and actual noisy speech issued by a wide range of speakers under various noise conditions. The results also indicate that the one-step clustering method gives better performance than the two-step clustering methods.
Sungyun JUNG Jongmok SON Keunsung BAE
In this paper, we propose a new feature extraction method that combines both HMT-based denoising and weighted filter bank analysis for robust speech recognition. The proposed method is made up of two stages in cascade. The first stage is denoising process based on the wavelet domain Hidden Markov Tree model, and the second one is the filter bank analysis with weighting coefficients obtained from the residual noise in the first stage. To evaluate performance of the proposed method, recognition experiments were carried out for additive white Gaussian and pink noise with signal-to-noise ratio from 25 dB to 0 dB. Experiment results demonstrate the superiority of the proposed method to the conventional ones.
Koji IWANO Takahiro SEKI Sadaoki FURUI
This paper proposes a noise robust speech recognition method using prosodic information. In Japanese, the fundamental frequency (F0) contour represents phrase intonation and word accent information. Consequently, it conveys information about prosodic phrases and word boundaries. This paper first describes a noise robust F0 extraction method using the Hough transform, which achieves high extraction rates under various noise environments. Then it proposes a robust speech recognition method using multi-stream HMMs which model both segmental spectral and F0 contour information. Speaker-independent experiments are conducted using connected digits uttered by 11 male speakers in various kinds of noise and SNR conditions. The recognition error rate is reduced in all noise conditions, and the best absolute improvement of digit accuracy is about 4.5%. This improvement is achieved by robust digit boundary detection using the prosodic information.
Yasunari OBUCHI Nobuo HATAOKA Richard M. STERN
In this paper we describe a new framework of feature compensation for robust speech recognition, which is suitable especially for small devices. We introduce Delta-cepstrum Normalization (DCN) that normalizes not only cepstral coefficients, but also their time-derivatives. Cepstral Mean Normalization (CMN) and Mean and Variance Normalization (MVN) are fast and efficient algorithms of environmental adaptation, and have been used widely. In those algorithms, normalization was applied to cepstral coefficients to reduce the irrelevant information from them, but such a normalization was not applied to time-derivative parameters because the reduction of the irrelevant information was not enough. However, Histogram Equalization (HEQ) provides better compensation and can be applied even to the delta and delta-delta cepstra. We investigate various implementation of DCN, and show that we can achieve the best performance when the normalization of the cepstra and the delta cepstra can be mutually interdependent. We evaluate the performance of DCN using speech data recorded by a PDA. DCN provides significant improvements compared to HEQ. It is shown that DCN gives 15% relative word error rate reduction from HEQ. We also examine the possibility of combining Vector Taylor Series (VTS) and DCN. Even though some combinations do not improve the performance of VTS, it is shown that the best combination gives the better performance than VTS alone. Finally, the advantage of DCN in terms of the computation speed is also discussed.