Yibo JIANG Hui BI Wei ZHAO Chen SHI Xiaolei WANG
For the RF power amplifier, its exposed input and output are susceptible to damage from Electrostatic (ESD) damage. The bi-direction protection is required at the input in push-pull operating mode. In this paper, considering the process compatibility to the power amplifier, cascaded Grounded-gate NMOS (ggNMOS) and Polysilicon diodes (PDIO) are stacked together to form an ESD clamp with forward and reverse protection. Through Transmission line pulse (TLP) and CV measurements, the clamp is demonstrated as latch-up immune and low parasitic capacitance bi-direction ESD protection, with 18.67/17.34V holding voltage (Vhold), 4.6/3.2kV ESD protection voltage (VESD), 0.401/0.415pF parasitic capacitance (CESD) on forward and reverse direction, respectively.
Yaohui QI Fuping PAN Fengpei GE Qingwei ZHAO Yonghong YAN
A smoothing method for minimum phone error linear regression (MPELR) is proposed in this paper. We show that the objective function for minimum phone error (MPE) can be combined with a prior mean distribution. When the prior mean distribution is based on maximum likelihood (ML) estimates, the proposed method is the same as the previous smoothing technique for MPELR. Instead of ML estimates, maximum a posteriori (MAP) parameter estimate is used to define the mode of prior mean distribution to improve the performance of MPELR. Experiments on a large vocabulary speech recognition task show that the proposed method can obtain 8.4% relative reduction in word error rate when the amount of data is limited, while retaining the same asymptotic performance as conventional MPELR. When compared with discriminative maximum a posteriori linear regression (DMAPLR), the proposed method shows improvement except for the case of limited adaptation data for supervised adaptation.
Yuzhuo LIU Hangting CHEN Qingwei ZHAO Pengyuan ZHANG
Weakly labelled semi-supervised audio tagging (AT) and sound event detection (SED) have become significant in real-world applications. A popular method is teacher-student learning, making student models learn from pseudo-labels generated by teacher models from unlabelled data. To generate high-quality pseudo-labels, we propose a master-teacher-student framework trained with a dual-lead policy. Our experiments illustrate that our model outperforms the state-of-the-art model on both tasks.
Jiao DU Ziwei ZHAO Shaojing FU Longjiang QU Chao LI
In this paper, we first recall the concept of 2-tuples distribution matrix, and further study its properties. Based on these properties, we find four special classes of 2-tuples distribution matrices. Then, we provide a new sufficient and necessary condition for n-variable rotation symmetric Boolean functions to be 2-correlation immune. Finally, we give a new method for constructing such functions when n=4t - 1 is prime, and we show an illustrative example.
Xiang ZHANG Ping LU Hongbin SUO Qingwei ZHAO Yonghong YAN
In this letter, a recently proposed clustering algorithm named affinity propagation is introduced for the task of speaker clustering. This novel algorithm exhibits fast execution speed and finds clusters with low error. However, experiments show that the speaker purity of affinity propagation is not satisfying. Thus, we propose a hybrid approach that combines affinity propagation with agglomerative hierarchical clustering to improve the clustering performance. Experiments show that compared with traditional agglomerative hierarchical clustering, the hybrid method achieves better performance on the test corpora.
Mengzhe CHEN Jielin PAN Qingwei ZHAO Yonghong YAN
Multi-task learning in deep neural networks has been proven to be effective for acoustic modeling in speech recognition. In the paper, this technique is applied to Mandarin-English code-mixing recognition. For the primary task of the senone classification, three schemes of the auxiliary tasks are proposed to introduce the language information to networks and improve the prediction of language switching. On the real-world Mandarin-English test corpus in mobile voice search, the proposed schemes enhanced the recognition on both languages and reduced the relative overall error rates by 3.5%, 3.8% and 5.8% respectively.
Junbo ZHANG Fuping PAN Bin DONG Qingwei ZHAO Yonghong YAN
This paper presents our investigation into improving the performance of our previous automatic reading quality assessment system. The method of the baseline system is calculating the average value of the Phone Log-Posterior Probability (PLPP) of all phones in the voice to be assessed, and the average value is used as the reading quality assessment feature. In this paper, we presents three improvements. First, we cluster the triphones, and then calculate the average value of the normalized PLPP for each classification separately, and use this average values as the multi-dimensional assessment features instead of the original one-dimensional assessment feature. This method is simple but effective, which made the score difference of the machine scoring and manual scoring decrease by 30.2% relatively. Second, in order to assess the reading rhythm, we train Gaussian Mixture Models (GMM), which contain the information of each triphone's relative duration under standard pronunciation. Using the GMM, we can calculate the probability that the relative duration of each phone is conform to the standard pronunciation, and the average value of the probabilities is added to the assessment feature vector as a dimension of feature, which decreased the score difference between the machine scoring and manual scoring by 9.7% relatively. Third, we detect Filled Pauses (FP) by analyzing the formant curve, and then calculate the relative duration of FP, and add the relative duration of FP to the assessment feature vector as a dimension of feature. This method made the score difference between the machine scoring and manual scoring be further decreased by 10.2% relatively. Finally, when the feature vector extracted by the three methods are used together, the score difference between the machine scoring and manual scoring was decreased by 43.9% relatively compared to the baseline system.
Wei ZHAO Rui XU Yasushi HIRANO Rie TACHIBANA Shoji KIDO Narufumi SUGANUMA
This paper describes a computer-aided diagnosis (CAD) method to classify pneumoconiosis on HRCT images. In Japan, the pneumoconiosis is divided into 4 types according to the density of nodules: Type 1 (no nodules), Type 2 (few small nodules), Type 3-a (numerous small nodules) and Type 3-b (numerous small nodules and presence of large nodules). Because most pneumoconiotic nodules are small-sized and irregular-shape, only few nodules can be detected by conventional nodule extraction methods, which would affect the classification of pneumoconiosis. To improve the performance of nodule extraction, we proposed a filter based on analysis the eigenvalues of Hessian matrix. The classification of pneumoconiosis is performed in the following steps: Firstly the large-sized nodules were extracted and cases of type 3-b were recognized. Secondly, for the rest cases, the small nodules were detected and false positives were eliminated. Thirdly we adopted a bag-of-features-based method to generate input vectors for a support vector machine (SVM) classifier. Finally cases of type 1,2 and 3-a were classified. The proposed method was evaluated on 175 HRCT scans of 112 subjects. The average accuracy of classification is 90.6%. Experimental result shows that our method would be helpful to classify pneumoconiosis on HRCT.
Yanqing SUN Yu ZHOU Qingwei ZHAO Yonghong YAN
This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1 kHz and 3 kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15 dB and 0 dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.
Xiang XIAO Xiang ZHANG Haipeng WANG Hongbin SUO Qingwei ZHAO Yonghong YAN
The GMM-UBM framework has been proved to be one of the most effective approaches to the automatic speaker verification (ASV) task in recent years. In this letter, we first propose an approximate decision function of traditional GMM-UBM, from which it is shown that the contribution to classification of each Gaussian component is equally important. However, research in speaker perception shows that a different speech sound unit defined by Gaussian component makes a different contribution to speaker verification. This motivates us to emphasize some sound units which have discriminability between speakers while de-emphasize the speech sound units which contain little information for speaker verification. Experiments on 2006 NIST SRE core task show that the proposed approach outperforms traditional GMM-UBM approach in classification accuracy.
Qian WANG Qingmei ZHOU Wei ZHAO Xuangou WU Xun SHAO
In the age of big data, recommendation systems provide users with fast access to interesting information, resulting to a significant commercial value. However, the extreme sparseness of user assessment data is one of the key factors that lead to the poor performance of recommendation algorithms. To address this problem, we propose a spectral clustering recommendation scheme with low-rank matrix completion and spectral clustering. Our scheme exploits spectral clustering to achieve the division of a similar user group. Meanwhile, the low-rank matrix completion is used to effectively predict un-rated items in the sub-matrix of the spectral clustering. With the real dataset experiment, the results show that our proposed scheme can effectively improve the prediction accuracy of un-rated items.
Han WANG Ruiliu FU Xuejun ZHANG Jun ZHOU Qingwei ZHAO
Lifelong language learning (LLL) aims at learning new tasks and retaining old tasks in the field of NLP. LAMOL is a recent LLL framework following data-free constraints. Previous works have been researched based on LAMOL with additional computing with more time costs or new parameters. However, they still have a gap between multi-task learning (MTL), which is regarded as the upper bound of LLL. In this paper, we propose Metacognitive Adaptation (Metac-Adapt) almost without adding additional time cost and computational resources to make the model generate better pseudo samples and then replay them. Experimental results demonstrate that Metac-Adapt is on par with MTL or better.
Zhaoqi LI Ta LI Qingwei ZHAO Pengyuan ZHANG
Query-by-example spoken term detection (QbE-STD) is a task of using speech queries to match utterances, and the acoustic word embedding (AWE) method of generating fixed-length representations for speech segments has shown high performance and efficiency in recent work. We propose an AWE training method using a label-adversarial network to reduce the interference information learned during AWE training. Experiments demonstrate that our method achieves significant improvements on multilingual and zero-resource test sets.
Xin LI Jielin PAN Qingwei ZHAO Yonghong YAN
Morphemes, which are obtained from morphological parsing, and statistical sub-words, which are derived from data-driven splitting, are commonly used as the recognition units for speech recognition of agglutinative languages. In this letter, we propose a discriminative approach to select the splitting result, which is more likely to improve the recognizer's performance, for each distinct word type. An objective function which involves the unigram language model (LM) probability and the count of misrecognized phones on the acoustic training data is defined and minimized. After determining the splitting result for each word in the text corpus, we select the frequent units to build a hybrid vocabulary including morphemes and statistical sub-words. Compared to a statistical sub-word based system, the hybrid system achieves 0.8% letter error rates (LERs) reduction on the test set.
Xiang ZHANG Hongbin SUO Qingwei ZHAO Yonghong YAN
In this letter, we propose a new approach to SVM based speaker recognition, which utilizes a kind of novel phonotactic information as the feature for SVM modeling. Gaussian mixture models (GMMs) have been proven extremely successful for text-independent speaker recognition. The GMM universal background model (UBM) is a speaker-independent model, each component of which can be considered as modeling some underlying phonetic sound classes. We assume that the utterances from different speakers should get different average posterior probabilities on the same Gaussian component of the UBM, and the supervector composed of the average posterior probabilities on all components of the UBM for each utterance should be discriminative. We use these supervectors as the features for SVM based speaker recognition. Experiment results on a NIST SRE 2006 task show that the proposed approach demonstrates comparable performance with the commonly used systems. Fusion results are also presented.
Junbo ZHANG Fuping PAN Bin DONG Qingwei ZHAO Yonghong YAN
In this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular “Goodness of Pronunciation” (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level.
Chunyan LIANG Lin YANG Qingwei ZHAO Yonghong YAN
In this letter, we adopt a new factor analysis of neighborhood-preserving embedding (NPE) for speaker verification. NPE aims at preserving the local neighborhood structure on the data and defines a low-dimensional speaker space called neighborhood-preserving embedding space. We compare the proposed method with the state-of-the-art total variability approach on the telephone-telephone core condition of the NIST 2008 Speaker Recognition Evaluation (SRE) dataset. The experimental results indicate that the proposed NPE method outperforms the total variability approach, providing up to 24% relative improvement.
Hai YANG Yunfei XU Qinwei ZHAO Ruohua ZHOU Yonghong YAN
Sparse representation has been studied within the field of signal processing as a means of providing a compact form of signal representation. This paper introduces a sparse representation based framework named Sparse Probabilistic Linear Discriminant Analysis in speaker recognition. In this latent variable model, probabilistic linear discriminant analysis is modified to obtain an algorithm for learning overcomplete sparse representations by replacing the Gaussian prior on the factors with Laplace prior that encourages sparseness. For a given speaker signal, the dictionary obtained from this model has good representational power while supporting optimal discrimination of the classes. An expectation-maximization algorithm is derived to train the model with a variational approximation to a range of heavy-tailed distributions whose limit is the Laplace. The variational approximation is also used to compute the likelihood ratio score of all trials of speakers. This approach performed well on the core-extended conditions of the NIST 2010 Speaker Recognition Evaluation, and is competitive compared to the Gaussian Probabilistic Linear Discriminant Analysis, in terms of normalized Decision Cost Function and Equal Error Rate.
Ruilin PAN Chuanming GE Li ZHANG Wei ZHAO Xun SHAO
Collaborative filtering (CF) is one of the most popular approaches to building Recommender systems (RS) and has been extensively implemented in many online applications. But it still suffers from the new user cold start problem that users have only a small number of items interaction or purchase records in the system, resulting in poor recommendation performance. Thus, we design a new similarity model which can fully utilize the limited rating information of cold users. We first construct a new metric, Popularity-Mean Squared Difference, considering the influence of popular items, average difference between two user's common ratings and non-numerical information of ratings. Moreover, the second new metric, Singularity-Difference, presents the deviation degree of favor to items between two users. It considers the distribution of the similarity degree of co-ratings between two users as weight to adjust the deviation degree. Finally, we take account of user's personal rating preferences through introducing the mean and variance of user ratings. Experiment results based on three real-life datasets of MovieLens, Epinions and Netflix demonstrate that the proposed model outperforms seven popular similarity methods in terms of MAE, precision, recall and F1-Measure under new user cold start condition.
Shang CAI Yeming XIAO Jielin PAN Qingwei ZHAO Yonghong YAN
Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.