1-15hit |
Byonghwa LEE Kwangki KIM Minsoo HAHN
In interactive audio services, users can render audio objects rather freely to match their desires and the spatial audio object coding (SAOC) scheme is fairly good both in the sense of bitrate and audio quality. But rather perceptible audio quality degradation can occur when an object is suppressed or played alone. To complement this, the SAOC scheme with Two-Step Coding (SAOC-TSC) was proposed. But the bitrate of the side information increases two times compared to that of the original SAOC due to the bitrate needed for the residual coding used to enhance the audio quality. In this paper, an efficient residual coding method of the SAOC-TSC is proposed to reduce the side information bitrate without audio quality degradation or complexity increase.
Seungkwon BEACK Seung H. NAM Minsoo HAHN
We present a new speech enhancement algorithm in a car environment with two microphones. The car audio signals and other background noises are the target noises to be suppressed. Our algorithm is composed of two main parts, i.e., the spatial and the temporal processes. The multi-channel blind deconvolution (MBD) is applied to the spatial process while the Kalman filter with a second-order high pass filter, for the temporal one. For the fast convergence, the MBD is newly expressed in frequency-domain with a normalization matrix. The final performance evaluated with the severely car noise corrupted speech shows that our algorithm produces noticeably enhanced speech.
Youngjoo SUH Hoirin KIM Minsoo HAHN Yongju LEE
In this letter, a new segment-level speech/nonspeech classification method based on the Poisson polling technique is proposed. The proposed method makes two modifications from the baseline Poisson polling method to further improve the classification accuracy. One of them is to employ Poisson mixture models to more accurately represent various segmental patterns of the observed frequencies for frame-level input features. The other is the soft counting-based frequency estimation to improve the reliability of the observed frequencies. The effectiveness of the proposed method is confirmed by the experimental results showing the maximum error reduction of 39% compared to the segmentally accumulated log-likelihood ratio-based method.
Kwangki KIM Minsoo HAHN Jinsul KIM
MPEG spatial audio object coding (SAOC) is a new audio coding standard which efficiently represents various audio objects as a down-mix signal and spatial parameters. MPEG SAOC has a backward compatibility with existing playback systems for the down-mix signal. If a mastering signal is used for providing CD-like sound quality instead of the down-mix signal, an output signal decoded with the mastering signal may be easily degraded due to the difference between the down-mix and the mastering signals. To successfully use the mastering signal in MPEG SAOC, the difference between two signals should be eliminated. As a simple mastering signal processing, we propose a mastering signal processing using the mastering down-mix gain (MDG) which is similar to the arbitrary down-mix gain of MPEG Surround. Also, we propose an enhanced mastering signal processing using the MDG bias in order to reduce quantization errors of the MDG. Experimental results show that the proposed schemes can improve sound quality of the output signal decoded with the mastering signal. Especially, the enhanced method shows better performance than the simple method in the aspects of the quantization errors and the sound quality.
Jinsul KIM Hyunwoo LEE Won RYU Byungsun LEE Minsoo HAHN
In this letter, we propose a shared adaptive packet loss concealment scheme for the high quality guaranteed Internet telephony service which connects multiple users. In order to recover packet loss efficiently in the all-IP based convergence environment, we provide a robust signal recovery scheme which is based on the shared adaptive both-side information utilization. This scheme is provided according to the average magnitude variation across the frames and the pitch period replication on the 1-port gateway (G/W) system. The simulated performance demonstrates that the proposed scheme has the advantages of low processing times and high recovery rates in the all-IP based ubiquitous environment.
Ji-Yeoun LEE Sangbae JEONG Minsoo HAHN
Combination of mutually complementary features is necessary to cope with various changes in pattern classification between normal and pathological voices. This paper proposes a method to improve pathological/normal voice classification performance by combining heterogeneous features. Different combinations of auditory-based and higher-order features are investigated. Their performances are measured by Gaussian mixture models (GMMs), linear discriminant analysis (LDA), and a classification and regression tree (CART) method. The proposed classification method by using the CART analysis is shown to be an effective method for pathological voice detection, with a 92.7% classification performance rate. This is a noticeable improvement of 54.32% compared to the MFCC-based GMM algorithm in terms of error reduction.
Ji-Yeoun LEE Sangbae JEONG Hong-Shik CHOI Minsoo HAHN
This work proposes new features to improve the pathological voice quality classification performance. They are the means, the variances, and the perturbations of the higher-order statistics (HOS) such as the skewness and the kurtosis. The HOS-based features show meaningful differences among normal, grade 1, grade 2, and grade 3 voices classified in the GRBAS scale. The jitter, the shimmer, the harmonic-to-noise ratio (HNR), and the variance of the short-time energy are utilized as the conventional features. The performances are measured by the classification and regression tree (CART) method. Specifically, the CART-based method by utilizing both the conventional features and the HOS-based ones shows its effectiveness in the pathological voice quality measurement, with the classification accuracy of 87.8%.
Heesik YANG Sangbae JEONG Minsoo HAHN
In this paper, we propose a new variable-bit-rate speech coder based on the waveform interpolation concept. After the coder extracts all parameters, the amounts of distortions between the current and the predicted parameters, which are estimated by extrapolation using the past two parameters, are measured for all parameters. A parameter is not transmitted unless the distortion exceeds the preset threshold. At the decoder side, the non-transmitted parameter is reconstructed by extrapolation with the past two parameters used to synthesize signals. In this way, we can reduce 26% of the total bit rate while maintaining the speech quality degradation below the 0.1 perceptual evaluation of speech quality (PESQ) score.
Seungho HAN Jungpyo HONG Sangbae JEONG Minsoo HAHN
An efficient noise reduction algorithm is proposed to improve speech recognition performance for human machine interfaces. In the algorithm, a probabilistic adaptation mode controller (AMC) is designed and adopted to the generalized sidelobe canceller (GSC). To detect target speech intervals, the proposed AMC calculates the inter-channel correlation and estimates speech absence probability (SAP). Based on the SAP, the adaptation mode of the adaptive filter in the GSC is decided. Experimental results show the proposed algorithm significantly improves speech recognition performances and signal-to-noise ratios in real noisy environments.
Sang-Jin KIM Jong-Jin KIM Minsoo HAHN
Development of a hidden Markov model (HMM)-based Korean speech synthesis system and its evaluation is described. Statistical HMM models for Korean speech units are trained with the hand-labeled speech database including the contextual information about phoneme, morpheme, word phrase, utterance, and break strength. The developed system produced speech with a fairly good prosody. The synthesized speech is evaluated and compared with that of our corpus-based unit concatenating Korean text-to-speech system. The two systems were trained with the same manually labeled speech database.
Keunseok CHO Sangbae JEONG Minsoo HAHN
This paper proposes a new algorithm to encode the spectral envelope for G.729.1 more accurately. It applies the normalized least-mean- square (NLMS) algorithm to each subband energy of the modified discrete cosine transform (MDCT) in the time-domain alias cancellation (TDAC) of G.729.1. By utilizing the estimation error of subband energies by means of NLMS, allocated bit reduction for spectral envelope coding is achieved. The saved bits are then reused to improve the spectral envelope estimation and thus enhance the sound quality. Experimental results confirm that the proposed algorithm improves the sound quality under both clean and packet loss conditions.
Jinsul KIM Hyunwoo LEE Won RYU Seungho HAN Minsoo HAHN
This letter mainly focuses on improving current noise reduction methods to solve the critical speech distortion problems with robust noise reduction in noisy speech signals for speech enhancement over IP networks. For robust noise reduction with packet loss recovery, we propose a novel optimized Wiener filtering technique that uses the estimated SNR (Signal-to-Noise Ratio) with packet loss recovery method which is applied as post-filtering over IP-networks. Simulation results demonstrate that the proposed scheme provides better reduction and recovery rates with considering packet loss and SNR environment than other methods.
Sangbae JEONG Hoirin KIM Minsoo HAHN
In this paper, we propose a useful algorithm that can be applied to reduce the response time of speech recognizers based on HMM's. In our algorithm, to reduce the response time, promising HMM states are selected by single Gaussians. In speech recognition, HMM state likelihoods are evaluated by the corresponding single Gaussians first, and then likelihoods by original full Gaussians are computed and replaced only for the HMM states having relatively large likelihoods. By doing so, we can reduce the pattern-matching time for speech recognition significantly without any noticeable loss of the recognition rate. In addition, we cluster the single Gaussians into groups by measuring the distance between Gaussians. Therefore, we can reduce the extra memory much more. In our 10,000 word Korean POI (point-of-interest) recognition task, our proposed algorithm shows 35.57% reduction of the response time in comparison with that of the baseline system at the cost of 10% degradation of the WER.
This letter describes a two-band excitation model for HMM-based speech synthesis. The HMM-based speech synthesis system generates speech from the HMM training data of the spectral and excitation parameters. Synthesized speech has a typical quality of "vocoded sound" mostly because of the simple excitation model with the voiced/unvoiced selection. In this letter, two-band excitation based on the harmonic plus noise speech model is proposed for generating the mixed excitation source. With this model, we can generate the mixed excitation more accurately and reduce the memory for the trained excitation data as well.
Heesik YANG Sangbae JEONG Minsoo HAHN
In our previous study, a distortion measure based variable bit rate (DM-VBR) scheme in waveform interpolation (WI) coders was proposed. In this paper, the repetition method is proposed to estimate non-transmitted parameters instead of the extrapolation method. For the further reduction of slowly evolving waveform (SEW) bit rates, the dimensions of the past parameters, which are different from those of the current parameters, are converted to match the dimension of the current ones. Distortions between interpolated sub-frames and original sub-frames are measured for the reduction of the SEW parameters. And the usefulness of several other distortion measures is also investigated instead of the simple log spectral distortion. Experimental results show that the coder adopting the new schemes offers above 41% bit rate reduction with almost unnoticeable output speech degradation.