1-13hit |
Han MA Qiaoling ZHANG Roubing TANG Lu ZHANG Yubo JIA
Recently, robust speech recognition for real-world applications has attracted much attention. This paper proposes a robust speech recognition method based on the teacher-student learning framework for domain adaptation. In particular, the student network will be trained based on a novel optimization criterion defined by the encoder outputs of both teacher and student networks rather than the final output posterior probabilities, which aims to make the noisy audio map to the same embedding space as clean audio, so that the student network is adaptive in the noise domain. Comparative experiments demonstrate that the proposed method obtained good robustness against noise.
Bei ZHAO Chen CHENG Zhenguo MA Feng YU
Cross correlation is a general way to estimate time delay of arrival (TDOA), with a computational complexity of O(n log n) using fast Fourier transform. However, since only one spike is required for time delay estimation, complexity can be further reduced. Guided by Chinese Remainder Theorem (CRT), this paper presents a new approach called Co-prime Aliased Sparse FFT (CASFFT) in O(n1-1/d log n) multiplications and O(mn) additions, where m is smooth factor and d is stage number. By adjusting these parameters, it can achieve a balance between runtime and noise robustness. Furthermore, it has clear advantage in parallelism and runtime for a large range of signal-to-noise ratio (SNR) conditions. The accuracy and feasibility of this algorithm is analyzed in theory and verified by experiment.
The sparse Fourier transform (SFT) seeks to recover k non-negligible Fourier coefficients from a k-sparse signal of length N (k«N). A single frequency signal can be recovered via the Chinese remainder theorem (CRT) with sub-sampled discrete Fourier transforms (DFTs). However, when there are multiple non-negligible coefficients, more of them may collide, and multiple stages of sub-sampled DFTs are needed to deal with such collisions. In this paper, we propose a combinatorial aliasing-based SFT (CASFT) algorithm that is robust to noise and greatly reduces the number of stages by iteratively recovering coefficients. First, CASFT detects collisions and recovers coefficients via the CRT in a single stage. These coefficients are then subtracted from each stage, and the process iterates through the other stages. With a computational complexity of O(klog klog 2N) and sample complexity of O(klog 2N), CASFT is a novel and efficient SFT algorithm.
Ryo AIHARA Ryoichi TAKASHIMA Tetsuya TAKIGUCHI Yasuo ARIKI
This paper presents a voice conversion (VC) technique for noisy environments based on a sparse representation of speech. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for noise-added spectral conversion between different speakers. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all training exemplars (frames), and it requires high computation times to obtain the weights of the source exemplars. In this paper, we propose a framework to train the basis matrices of the source and target exemplars so that they have a common weight matrix. By using the basis matrices instead of the exemplars, the VC is performed with lower computation times than with the exemplar-based method. The effectiveness of this method was confirmed by comparing its effectiveness (in speaker conversion experiments using noise-added speech data) with that of an exemplar-based method and a conventional Gaussian mixture model (GMM)-based method.
Ryoichi TAKASHIMA Tetsuya TAKIGUCHI Yasuo ARIKI
This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.
During the production of speech signals, the vowel onset point is an important event containing important information for many speech processing tasks, such as consonant-vowel unit recognition and speech end-points detection. In order to realize accurate automatic detection of vowel onset points, this paper proposes a reliable method using the energy characteristics of homomorphic filtered spectral peaks. The homomorphic filtering helps to separate the slowly varying vocal tract system characteristics from the rapidly fluctuating excitation characteristics in the cepstral domain. The distinct vocal tract shape related to vowels is obtained and the peaks in the estimated vocal tract spectrum provide accurate and stable information for VOP detection. Performance of the proposed method is compared with the existing method which uses the combination of evidence from the excitation source, spectral peaks, and modulation spectrum energies. The detection rate with different time resolutions, together with the missing rate and spurious rate, are used for comprehensive evaluation of the performance on continuous speech taken from the TIMIT database. The detection accuracy of the proposed method is 74.14% for ±10 ms resolution and it increases to 96.33% for ±40 ms resolution with 3.67% missing error and 4.14% spurious error, much better than the results obtained by the combined approach at each specified time resolution, especially the higher resolutions of ±10±30 ms. In the cases of speech corrupted by white noise, pink noise and f-16 noise, the proposed method also shows significant improvement in the performance compared with the existing method.
Shang CAI Yeming XIAO Jielin PAN Qingwei ZHAO Yonghong YAN
Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.
Masaki KOBAYASHI Hirofumi YAMADA Michimasa KITAHARA
Complex-valued Associative Memory (CAM) is an advanced model of Hopfield Associative Memory. The CAM is based on multi-state neurons and has the high ability of representation. Lee proposed gradient descent learning for the CAM to improve the storage capacity. It is based on only the phases of input signals. In this paper, we propose another type of gradient descent learning based on both the phases and the amplitude. The proposed learning method improves the noise robustness and accelerates the learning speed.
Satoshi KOBASHIKAWA Satoshi TAKAHASHI
Users require speech recognition systems that offer rapid response and high accuracy concurrently. Speech recognition accuracy is degraded by additive noise, imposed by ambient noise, and convolutional noise, created by space transfer characteristics, especially in distant talking situations. Against each type of noise, existing model adaptation techniques achieve robustness by using HMM-composition and CMN (cepstral mean normalization). Since they need an additive noise sample as well as a user speech sample to generate the models required, they can not achieve rapid response, though it may be possible to catch just the additive noise in a previous step. In the previous step, the technique proposed herein uses just the additive noise to generate an adapted and normalized model against both types of noise. When the user's speech sample is captured, only online-CMN need be performed to start the recognition processing, so the technique offers rapid response. In addition, to cover the unpredictable S/N values possible in real applications, the technique creates several S/N HMMs. Simulations using artificial speech data show that the proposed technique increased the character correct rate by 11.62% compared to CMN.
Konstantin MARKOV Tomoko MATSUI Rainer GRUHN Jinsong ZHANG Satoshi NAKAMURA
This paper presents the ATR speech recognition system designed for the DARPA SPINE2 evaluation task. The system is capable of dealing with speech from highly variable, real-world noisy conditions and communication channels. A number of robust techniques are implemented, such as differential spectrum mel-scale cepstrum features, on-line MLLR adaptation, and word-level hypothesis combination, which led to a significant reduction in the word error rate.
Ashraf A. M. KHALAF Kenji NAKAYAMA
A nonlinear time series predictor was proposed, in which a nonlinear sub-predictor (NSP) and a linear sub-predictor (LSP) are combined in a cascade form. This model is called "hybrid predictor" here. The nonlinearity analysis method of the input time series was also proposed to estimate the network size. We have considered the nonlinear prediction problem as a pattern mapping one. A multi-layer neural network, which consists of sigmoidal hidden neurons and a single linear output neuron, has been employed as a nonlinear sub-predictor. Since the NSP includes nonlinear functions, it can predict the nonlinearity of the input time series. However, the prediction is not complete in some cases. Therefore, the NSP prediction error is further compensated for by employing a linear sub-predictor after the NSP. In this paper, the prediction mechanism and a role of the NSP and the LSP are theoretically and experimentally analyzed. The role of the NSP is to predict the nonlinear and some part of the linear property of the time series. The LSP works to predict the NSP prediction error. Furthermore, predictability of the hybrid predictor for noisy time series is investigated. The sigmoidal functions used in the NSP can suppress the noise effects by using their saturation regions. Computer simulations, using several kinds of nonlinear time series and other conventional predictor models, are demonstrated. The theoretical analysis of the predictor mechanism is confirmed through these simulations. Furthermore, predictability is improved by slightly expanding or shifting the input potential of the hidden neurons toward the saturation regions in the learning process.
Kazutoshi KOBAYASHI Kazuhiko TERADA Hidetoshi ONODERA Keikichi TAMARU
We propose a real-time low-rate video compression algorithm using fixed-rate multi-stage hierarchical vector quantization. Vector quantization is suitable for mobile computing, since it demands small computation on decoding. The proposed algorithm enables transmission of 10 QCIF frames per second over a low-rate 29.2 kbps mobile channel. A frame is hierarchically divided by sub-blocks. A frame of images is compressed in a fixed rate at any video activity. For active frames, large sub-blocks for low resolution are mainly transmitted. For inactive frames, smaller sub-blocks for high resolution can be transmitted successively after a motion-compensated frame. We develop a compression system which consists of a host computer and a memory-based processor for the nearest neighbor search on VQ. Our algorithm guarantees real-time decoding on a poor CPU.
Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA
This paper describes subband-crosscorrelation analysis (SBXCOR) using two input channel signals. SBXCOR is an extended signal processing technique of subband-autocorrelation analysis (SBCOR) that extracts periodicities associated with the inverse of center frequencies present in speech signals. In addition, to extract more periodicity information associated with the inverse of center frequencies, the multi-delay weighting (MDW) processing is applied to SBXCOR. In experiments, the noise robustness of SBXCOR is evaluated using a DTW word recognizer under (1) a simulated acoustic condition with white noise and (2) a real acoustic condition in a sound proof room with human speech-like noise. As the results, under the simulated acoustic condition, it is shown that SBXCOR is more robust than the conventional one-channel SBCOR, but less robust than SBCOR extracted from the two-channel-summed signal. Furthermore, by applying MDW processing, the performance of SBXCOR improved about 2% at SNR 0 dB. The resultant performance of SBXCOR with MDW processing was much better than those of smoothed group delay spectrum (SGDS) and mel-filterbank cepstral coefficient (MFCC) below SNR 10 dB. The results under the real acoustic condition were almost the same as the simulated acoustic condition.