Lu YIN Junfeng LI Yonghong YAN Masato AKAGI
The simultaneous utterances impact the ability of both the hearing-impaired persons and automatic speech recognition systems. Recently, deep neural networks have dramatically improved the speech separation performance. However, most previous works only estimate the speech magnitude and use the mixture phase for speech reconstruction. The use of the mixture phase has become a critical limitation for separation performance. This study proposes a two-stage phase-aware approach for multi-talker speech separation, which integrally recovers the magnitude as well as the phase. For the phase recovery, Multiple Input Spectrogram Inversion (MISI) algorithm is utilized due to its effectiveness and simplicity. The study implements the MISI algorithm based on the mask and gives that the ideal amplitude mask (IAM) is the optimal mask for the mask-based MISI phase recovery, which brings less phase distortion. To compensate for the error of phase recovery and minimize the signal distortion, an advanced mask is proposed for the magnitude estimation. The IAM and the proposed mask are estimated at different stages to recover the phase and the magnitude, respectively. Two frameworks of neural network are evaluated for the magnitude estimation on the second stage, demonstrating the effectiveness and flexibility of the proposed approach. The experimental results demonstrate that the proposed approach significantly minimizes the distortions of the separated speech.
Cheng-Hong YANG Li-Yeh CHUANG Cheng-Huei YANG Ching-Hsing LUO
In this paper, Morse code is selected as a communication adaptive device for persons whose hand coordination and dexterity are impaired by such ailments as amyotrophic lateral sclerosis, multiple sclerosis, muscular dystrophy, and other severe handicaps. Morse code is composed of a series of dots, dashes, and space intervals, and each element is transmitted by sending a signal for a defined length of time. A suitable adaptive automatic recognition method is needed for persons with disabilities due to their difficulty in maintaining a stable typing rate. To overcome this problem, the proposed method combines the support vector machines method with a variable degree variable step size LMS algorithm. The method is divided into five stages: tone recognition, space recognition, training process, adaptive processing, and character recognition. Statistical analyses demonstrated that the proposed method elicited a better recognition rate in comparison to alternative methods from the literature.
Yu ZHOU Junfeng LI Yanqing SUN Jianping ZHANG Yonghong YAN Masato AKAGI
In this paper, we present a hybrid speech emotion recognition system exploiting both spectral and prosodic features in speech. For capturing the emotional information in the spectral domain, we propose a new spectral feature extraction method by applying a novel non-uniform subband processing, instead of the mel-frequency subbands used in Mel-Frequency Cepstral Coefficients (MFCC). For prosodic features, a set of features that are closely correlated with speech emotional states are selected. In the proposed hybrid emotion recognition system, due to the inherently different characteristics of these two kinds of features (e.g., data size), the newly extracted spectral features are modeled by Gaussian Mixture Model (GMM) and the selected prosodic features are modeled by Support Vector Machine (SVM). The final result of the proposed emotion recognition system is obtained by combining the results from these two subsystems. Experimental results show that (1) the proposed non-uniform spectral features are more effective than the traditional MFCC features for emotion recognition; (2) the proposed hybrid emotion recognition system using both spectral and prosodic features yields the relative recognition error reduction rate of 17.0% over the traditional recognition systems using only the spectral features, and 62.3% over those using only the prosodic features.
Xin LI Jielin PAN Qingwei ZHAO Yonghong YAN
Morphemes, which are obtained from morphological parsing, and statistical sub-words, which are derived from data-driven splitting, are commonly used as the recognition units for speech recognition of agglutinative languages. In this letter, we propose a discriminative approach to select the splitting result, which is more likely to improve the recognizer's performance, for each distinct word type. An objective function which involves the unigram language model (LM) probability and the count of misrecognized phones on the acoustic training data is defined and minimized. After determining the splitting result for each word in the text corpus, we select the frequent units to build a hybrid vocabulary including morphemes and statistical sub-words. Compared to a statistical sub-word based system, the hybrid system achieves 0.8% letter error rates (LERs) reduction on the test set.
Qingqing ZHANG Jielin PAN Yang LIN Jian SHAO Yonghong YAN
In recent decades, there has been a great deal of research into the problem of bilingual speech recognition - to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was comparable to that of the baseline monolingual Mandarin system. The performance for bilingual utterances achieved 22.37% relative PER reduction.
Chong WU Le ZHANG Houwang ZHANG Hong YAN
In this letter, we propose a hierarchical segmentation (HS) method for color images, which can not only maintain the segmentation accuracy, but also ensure a good speed. In our method, HS adopts the fuzzy simple linear iterative clustering (Fuzzy SLIC) to obtain an over-segmentation result. Then, HS uses the fast fuzzy C-means clustering (FFCM) to produce the rough segmentation result based on superpixels. Finally, HS takes the non-iterative K-means clustering using priority queue (KPQ) to refine the segmentation result. In the validation experiments, we tested our method and compared it with state-of-the-art image segmentation methods on the Berkeley (BSD500) benchmark under different types of noise. The experiment results show that our method outperforms state-of-the-art techniques in terms of accuracy, speed and robustness.
Xiang ZHANG Hongbin SUO Qingwei ZHAO Yonghong YAN
In this letter, we propose a new approach to SVM based speaker recognition, which utilizes a kind of novel phonotactic information as the feature for SVM modeling. Gaussian mixture models (GMMs) have been proven extremely successful for text-independent speaker recognition. The GMM universal background model (UBM) is a speaker-independent model, each component of which can be considered as modeling some underlying phonetic sound classes. We assume that the utterances from different speakers should get different average posterior probabilities on the same Gaussian component of the UBM, and the supervector composed of the average posterior probabilities on all components of the UBM for each utterance should be discriminative. We use these supervectors as the features for SVM based speaker recognition. Experiment results on a NIST SRE 2006 task show that the proposed approach demonstrates comparable performance with the commonly used systems. Fusion results are also presented.
Hong YANG Linbo QING Xiaohai HE Shuhua XIONG
Wireless video sensor networks address problems, such as low power consumption of sensor nodes, low computing capacity of nodes, and unstable channel bandwidth. To transmit video of distributed video coding in wireless video sensor networks, we propose an efficient scalable distributed video coding scheme. In this scheme, the scalable Wyner-Ziv frame is based on transmission of different wavelet information, while the Key frame is based on transmission of different residual information. A successive refinement of side information for the Wyner-Ziv and Key frames are proposed in this scheme. Test results show that both the Wyner-Ziv and Key frames have four layers in quality and bit-rate scalable, but no increase in complexity of the encoder.
Zhi LIU Yifan SU Shuzhong YANG Mengmeng ZHANG
Cross-component linear model (CCLM) chromaticity prediction is a new technique introduced in Versatile Video Coding (VVC), which utilizes the reconstructed luminance component to predict the chromaticity parts, and can improve the coding performance. However, it increases the coding complexity. In this paper, how to accelerate the chroma intra-prediction process is studied based on texture characteristics. Firstly, two observations have been found through experimental statistics for the process. One is that the choice of the chroma intra-prediction candidate modes is closely related to the texture complexity of the coding unit (CU), and the other is that whether the direct mode (DM) is selected is closely related to the texture similarity between current chromaticity CU and the corresponding luminance CU. Secondly, a fast chroma intra-prediction mode decision algorithm is proposed based on these observations. A modified metric named sum modulus difference (SMD) is introduced to measure the texture complexity of CU and guide the filtering of the irrelevant candidate modes. Meanwhile, the structural similarity index measurement (SSIM) is adopted to help judging the selection of the DM mode. The experimental results show that compared with the reference model VTM8.0, the proposed algorithm can reduce the coding time by 12.92% on average, and increases the BD-rate of Y, U, and V components by only 0.05%, 0.32%, and 0.29% respectively.
Jieling WANG Yinghui ZHANG Hong YANG Kechu YI
In this letter, the interference cancellation technique is introduced to single carrier (SC) block transmission systems in sparse Rician frequency selective fading channels, and an effective equalizer is presented. Hard decision on the transmitted signal is made by commonly used SC equalizers, and every multipath signal can be constructed by the initial solution and channel state information. Then, final demodulation result is obtained by the line-of-sight component in the received signal which can be achieved by cancelling the other multipath signals in the received signal. The solution can be further used to construct the multipath signals allowing a multistage detector with higher performance to be realized. It is shown by Monte Carlo simulations in an SUI-5 channel that the new scheme offers dramatically higher performance than traditional equalization schemes.
Chuan CAO Ming LI Xiao WU Hongbin SUO Jian LIU Yonghong YAN
In this letter, we present an automatic approach of objective singing performance evaluation for untrained singers by relating acoustic measurements to perceptual ratings of singing voice quality. Several acoustic parameters and their combination features are investigated to find objective correspondences of the perceptual evaluation criteria. Experimental results show relative strong correlation between perceptual ratings and the combined features and the reliability of the proposed evaluation system is tested to be comparable to human judges.
Yinan SUN Yongpan LIU Zhibo WANG Huazhong YANG
Function speculation design with error recovery mechanisms is quite promising due to its high performance and low area overhead. Previous work has focused on two-stage function speculation and thus lacks a systematic way to address the challenge of the multistage function speculation approach. This paper proposes a multistage function speculation with adaptive predictors and applies it in a novel adder. We deduced the analytical performance and area models for the design and validated them in our experiments. Based on those models, a general methodology is presented to guide design optimization. Both analytical proofs and experimental results on the fabricated chips show that the proposed adder's delay and area have a logarithmic and linear relationship with its bit number, respectively. Compared with the DesignWare IP, the proposed adder provides the same performance with 6-17% area reduction under different bit lengths.
Junbo ZHANG Fuping PAN Bin DONG Qingwei ZHAO Yonghong YAN
In this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular “Goodness of Pronunciation” (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level.
Jieling WANG Hong YANG Kechu YI
A space-time and multipath diversity combining algorithm is presented for STBC single carrier block transmission system with two transmit and one receive antennas. The initial solution is achieved by an STBC-based frequency domain equalizer, and the multipath components in the received signal are decoupled by this initial solution and channel state information. Finally, STBC combining is carried out on each decoupled multipath component separately, and then the single carrier output branches are combined further using the maximal ratio combining (MRC) algorithm.
Chunyan LIANG Lin YANG Qingwei ZHAO Yonghong YAN
In this letter, we adopt a new factor analysis of neighborhood-preserving embedding (NPE) for speaker verification. NPE aims at preserving the local neighborhood structure on the data and defines a low-dimensional speaker space called neighborhood-preserving embedding space. We compare the proposed method with the state-of-the-art total variability approach on the telephone-telephone core condition of the NIST 2008 Speaker Recognition Evaluation (SRE) dataset. The experimental results indicate that the proposed NPE method outperforms the total variability approach, providing up to 24% relative improvement.
Hai YANG Yunfei XU Qinwei ZHAO Ruohua ZHOU Yonghong YAN
Sparse representation has been studied within the field of signal processing as a means of providing a compact form of signal representation. This paper introduces a sparse representation based framework named Sparse Probabilistic Linear Discriminant Analysis in speaker recognition. In this latent variable model, probabilistic linear discriminant analysis is modified to obtain an algorithm for learning overcomplete sparse representations by replacing the Gaussian prior on the factors with Laplace prior that encourages sparseness. For a given speaker signal, the dictionary obtained from this model has good representational power while supporting optimal discrimination of the classes. An expectation-maximization algorithm is derived to train the model with a variational approximation to a range of heavy-tailed distributions whose limit is the Laplace. The variational approximation is also used to compute the likelihood ratio score of all trials of speakers. This approach performed well on the core-extended conditions of the NIST 2010 Speaker Recognition Evaluation, and is competitive compared to the Gaussian Probabilistic Linear Discriminant Analysis, in terms of normalized Decision Cost Function and Equal Error Rate.
Cheng-Hong YANG Li-Yeh CHUANG Cheng-Huei YANG Ching-Hsing LUO
Assistive technology (AT) is becoming increasingly important for improving the mobility and language learning capabilities of persons with disabilities, thus enabling them to function independently and to improve their social opportunities. The Morse code has been shown to be a valuable tool in assistive technology, augmentative and alternative communication, and rehabilitation for people with neuromuscular diseases such as amyotrophic lateral sclerosis, multiple sclerosis, and muscular dystrophy. In this paper, we designed and implemented a wireless environmental control aid system using the Morse code as an adapted access communication tool, which includes three types of switch: single-switch, double-switch, and six-switch types. People with disabilities can easily control all types of electronic appliance without restrictions owing to spatial arrangements using a signal transmission based on radio frequency (RF). Experimental results revealed that three participants with disabilities were able to gain access to electronic facilities after six weeks of practice with the new system.
Shang CAI Yeming XIAO Jielin PAN Qingwei ZHAO Yonghong YAN
Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.
Hang REN Qingwei ZHAO Yonghong YAN
The optimization of spoken dialog management policies is a non-trivial task due to the erroneous inputs from speech recognition and language understanding modules. The dialog manager needs to ground uncertain semantic information at times to fully understand the need of human users and successfully complete the required dialog tasks. Approaches based on reinforcement learning are currently mainstream in academia and have been proved to be effective, especially when operating in noisy environments. However, in reinforcement learning the dialog strategy is often represented by complex numeric model and thus is incomprehensible to humans. The trained policies are very difficult for dialog system designers to verify or modify, which largely limits the deployment for commercial applications. In this paper we propose a novel framework for optimizing dialog policies specified in human-readable domain language using genetic algorithm. We present learning algorithms using user simulator and real human-machine dialog corpora. Empirical experimental results show that the proposed approach can achieve competitive performance on par with some state-of-the-art reinforcement learning algorithms, while maintaining a comprehensible policy structure.
Yongpan LIU Yiqun WANG Hengyu LONG Huazhong YANG
Battery-powered wireless sensor networks are prone to premature failures because some nodes deplete their batteries more rapidly than others due to workload variations, the many-to-one traffic pattern, and heterogeneous hardware. Most previous sensor network lifetime enhancement techniques focused on balancing the power distribution, assuming the usage of the identical battery. This paper proposes a novel fine-grained cost-constrained lifetime-aware battery allocation solution for sensor networks with arbitrary topologies and heterogeneous power distributions. Based on an energy–cost battery pack model and optimal node partitioning algorithm, a rapid battery pack selection heuristic is developed and its deviation from optimality is quantified. Furthermore, we investigate the impacts of the power variations on the lifetime extension by battery allocation. We prove a theorem to show that power variations of nodes are more likely to reduce the lifetime than to increase it. Experimental results indicate that the proposed technique achieves network lifetime improvements ranging from 4–13 over the uniform battery allocation, with no more than 10 battery pack levels and 2-5 orders of magnitudes speedup compared with a standard integer nonlinear program solver (INLP).