1-20hit |
Shin Jae KANG Kang Hyun LEE Nam Soo KIM
In this letter, we propose a novel supervised pre-training technique for deep neural network (DNN)-hidden Markov model systems to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB, and better results were observed compared to a number of conventional pre-training methods.
In this letter, we propose a new approach to estimate the degree of noise masking based on a sophisticated model for clean speech distribution. This measure, named as noise masking probability (NMP), is incorporated into the feature compensation technique to achieve robust speech recognition in noisy environments. Experimental results show that the proposed approach improves the performance of the baseline recognition system in the presence of various background noises.
Jong Kyu KIM Jung Su KIM Hwan Sik YUN Joon-Hyuk CHANG Nam Soo KIM
This letter presents a novel frame splitting scheme for an error-robust audio streaming over packet-switching networks. In our approach to perceptual audio coding, an audio frame is split into several subframes based on the network configuration such that each packet can be decoded independently at the receiver. Through a subjective comparison category rating (CCR) test, it is discovered that our approach enhances the quality of the decoded audio signal under the lossy packet-switching networks environment.
This letter introduces a pre-rejection technique for wireless channel distorted speech with application to automatic speech recognition (ASR). Based on analysis of distorted speech signals over a wireless communication channel, we propose a method to reject the channel distorted speech with a small computational load. From a number of simulation results, we can discover that the pre-rejection algorithm enhances the robustness of speech recognition operation.
In this paper, we propose new approaches to speech enhancement based on soft decision. In order to enhance the statistical reliability in estimating speech activity, we introduce the concept of a global speech absence probability (GSAP). First, we compute the conventional speech absence probability (SAP) and then modify it according to the newly proposed GSAP. The modification is made in such a way that the SAP has the same value of GSAP in the case of speech absence while it is maintained to its original value when the speech is present. Moreover, for improving the performance of the SAP's at voice tails (transition periods from speech to silence), we revise the SAP's using a hang-over scheme based on the hidden Markov model (HMM). In addition, we suggest a robust noise update algorithm in which the noise power is estimated not only in the periods of speech absence but also during speech activity based on soft decision. Also, for improving the SAP determination and noise update routines, we present a new signal to noise ratio (SNR) concept which is called the predicted SNR in this paper. Moreover, we demonstrate that the discrete cosine transform (DCT) enhances the accuracy of the SAP estimation. A number of tests show that the proposed method which is called the speech enhancement based on soft decision (SESD) algorithm yields better performance than the conventional approaches.
Kisoo KWON Jong Won SHIN Nam Soo KIM
Nonnegative matrix factorization (NMF) is an unsupervised technique to represent nonnegative data as linear combinations of nonnegative bases, which has shown impressive performance for source separation. However, its source separation performance degrades when one signal can also be described well with the bases for the interfering source signals. In this paper, we propose a discriminative NMF (DNMF) algorithm which exploits the reconstruction error for the interfering signals as well as the target signal based on target bases. The objective function for training the bases is constructed so as to yield high reconstruction error for the interfering source signals while guaranteeing low reconstruction error for the target source signals. Experiments show that the proposed method outperformed the standard NMF and another DNMF method in terms of both the perceptual evaluation of speech quality score and signal-to-distortion ratio in various noisy environments.
Yu Gwang JIN Nam Soo KIM Joon-Hyuk CHANG
In this letter, we propose a novel speech enhancement algorithm based on data-driven residual gain estimation. The entire system consists of two stages. At the first stage, a conventional speech enhancement algorithm enhances the input signal while estimating several signal-to-noise ratio (SNR)-related parameters. The residual gain, which is estimated by a data-driven method, is applied to further enhance the signal at the second stage. A number of experimental results show that the proposed speech enhancement algorithm outperforms the conventional speech enhancement technique based on soft decision and the data-driven approach using SNR grid look-up table.
Sung Soo KIM Chang Woo HAN Nam Soo KIM
In this letter, we present useful features accounting for pronunciation prominence and propose a classification technique for prominence detection. A set of phone-specific features are extracted based on a forced alignment of the test pronunciation provided by a speech recognition system. These features are then applied to the traditional classifiers such as the support vector machine (SVM), artificial neural network (ANN) and adaptive boosting (Adaboost) for detecting the place of prominence.
Chang Woo HAN Shin Jae KANG Nam Soo KIM
In this letter, we propose a novel approach to human activity recognition. We present a class of features that are robust to the tilt of the attached sensor module and a state transition model suitable for HMM-based activity recognition. In addition, postprocessing techniques are applied to stabilize the recognition results. The proposed approach shows significant improvements in recognition experiments over a variety of human activity DB.
Jong Won SHIN Joon-Hyuk CHANG Nam Soo KIM
In this letter, we propose a novel approach to speech enhancement, which incorporates a new criterion based on residual noise shaping. In the proposed approach, our goal is to make the residual noise perceptually comfortable instead of making it less audible. A predetermined `comfort noise' is provided as a target for the spectral shaping. Based on some assumptions, the resulting spectral gain function turns out to be a slight modification of the Wiener filter while requiring very low computational complexity. Subjective listening test shows that the proposed algorithm outperforms the conventional spectral enhancement technique based on soft decision and the noise suppression implemented in IS-893 Selectable Mode Vocoder.
Hwan Sik YUN Kiho CHO Nam Soo KIM
Acoustic data transmission is a technique which embeds data in a sound wave imperceptibly and detects it at a receiver. The data are embedded in an original audio signal and transmitted through the air by playing back the data-embedded audio using a loudspeaker. At the receiver, the data are extracted from the received audio signal captured by a microphone. In our previous work, we proposed an acoustic data transmission system designed based on phase modification of the modulated complex lapped transform (MCLT) coefficients. In this paper, we propose the spectral magnitude adjustment (SMA) technique which not only enhances the quality of the data-embedded audio signal but also improves the transmission performance of the system.
Joon-Hyuk CHANG Nam Soo KIM Sanjit K. MITRA
In this letter, we propose an approach to incorporate a statistical model for the voiced/unvoiced (V/UV) speech decision under background noise environments. Our approach consists of splitting the input noisy speech into two separate bands and applying a statistical model for each band. We compute and compare the likelihood ratio (LR) for each band based on the statistical model and estimated noise statistics for the V/UV decision. According to the simulation test, the proposed V/UV decision shows a better performance compared with the selectable mode vocoder (SMV) V/UV decision algorithm, particularly in clean and white noise environments.
Recently, notable improvements in voice activity detection (VAD) problem have been achieved by adopting several machine learning techniques. Among them, the deep neural network (DNN) which learns the mapping between the noisy speech features and the corresponding voice activity status with its deep hidden structure has been one of the most popular techniques. In this letter, we propose a novel approach which enhances the robustness of DNN in mismatched noise conditions with multi-task learning (MTL) framework. In the proposed algorithm, a feature enhancement task for speech features is jointly trained with the conventional VAD task. The experimental results show that the DNN with the proposed framework outperforms the conventional DNN-based VAD algorithm.
June Sig SUNG Doo Hwa HONG Hyun Woo KOO Nam Soo KIM
In our previous study, we proposed the waveform interpolation (WI) approach to model the excitation signals for hidden Markov model (HMM)-based speech synthesis. This letter presents several techniques to improve excitation modeling within the WI framework. We propose both the time domain and frequency domain zero padding techniques to reduce the spectral distortion inherent in the synthesized excitation signal. Furthermore, we apply non-negative matrix factorization (NMF) to obtain a low-dimensional representation of the excitation signals. From a number of experiments, including a subjective listening test, the proposed method has been found to enhance the performance of the conventional excitation modeling techniques.
Woohyung LIM Chang Woo HAN Nam Soo KIM
In this letter, we propose a novel approach to feature compensation performed in the cepstral domain. Processing in the cepstral domain has the advantage that the spectral correlation among different frequencies is taken into consideration. By introducing a linear approximation with diagonal covariance assumption, we modify the conventional log-spectral domain feature compensation technique to fit to the cepstral domain. The proposed approach shows significant improvements in the AURORA2 speech recognition task.
Chang Woo HAN Shin Jae KANG Nam Soo KIM
In this letter, we propose a novel approach to estimate three different kinds of phone mismatch penalty matrices for two-stage keyword spotting. When the output of a phone recognizer is given, detection of a specific keyword is carried out through text matching with the phone sequences provided by the specified keyword using the proposed phone mismatch penalty matrices. The penalty matrices associated with substitution, insertion and deletion errors are estimated from the training data through deliberate error generation. The proposed approach has shown a significant improvement in a Korean continuous speech recognition task.
Doo Hwa HONG June Sig SUNG Kyung Hwan OH Nam Soo KIM
Decision tree-based clustering and parameter estimation are essential steps in the training part of an HMM-based speech synthesis system. These two steps are usually performed based on the maximum likelihood (ML) criterion. However, one of the drawbacks of the ML criterion is that it is sensitive to outliers which usually result in quality degradation of the synthesized speech. In this letter, we propose an approach to detect and remove outliers for HMM-based speech synthesis. Experimental results show that the proposed approach can improve the synthetic speech, particularly when the available training speech database is insufficient.
In this paper, we propose a novel target acoustic signal detection approach which is based on non-negative matrix factorization (NMF). Target basis vectors are trained from the target signal database through NMF, and input vectors are projected onto the subspace spanned by these target basis vectors. By analyzing the distribution of time-varying normalized projection error, the optimal threshold can be calculated to detect the target signal intervals during the entire input signal. Experimental results show that the proposed algorithm can detect the target signal successfully under various signal environments.
In this letter, we propose a coding mode selection method for the AMR-WB+ audio coder based on a decision tree. In order to reduce computation while maintaining good performance, decision tree classifier is adopted with the closed loop mode selection results as the target classification labels. The size of the decision tree is controlled by pruning, so the proposed method does not increase the memory requirement significantly. Through an evaluation test on a database covering both speech and music materials, the proposed method is found to achieve a much better mode selection accuracy compared with the open loop mode selection module in the AMR-WB+.
Joon-Hyuk CHANG Dong Seok JEONG Nam Soo KIM Sangki KANG
In this letter, we propose an improved global soft decision for noisy speech enhancement. From an investigation of statistical model-based speech enhancement, it is discovered that a global soft decision has a fundamental drawback at the speech tail regions of speech signals. For that reason, we propose a new solution based on a smoothed likelihood ratio for the global soft decision. Performances of the proposed method are evaluated by subjective tests under various environments and show better results compared with the our previous work.