Author Search Result

[Author] Yonghong YAN(34hit)

21-34hit(34hit)

  • Discriminative Pronunciation Modeling Using the MPE Criterion

    Meixu SONG  Jielin PAN  Qingwei ZHAO  Yonghong YAN  

     
    LETTER-Speech and Hearing

      Pubricized:
    2014/12/02
      Vol:
    E98-D No:3
      Page(s):
    717-720

    Introducing pronunciation models into decoding has been proven to be benefit to LVCSR. In this paper, a discriminative pronunciation modeling method is presented, within the framework of the Minimum Phone Error (MPE) training for HMM/GMM. In order to bring the pronunciation models into the MPE training, the auxiliary function is rewritten at word level and decomposes into two parts. One is for co-training the acoustic models, and the other is for discriminatively training the pronunciation models. On Mandarin conversational telephone speech recognition task, compared to the baseline using a canonical lexicon, the discriminative pronunciation models reduced the absolute Character Error Rate (CER) by 0.7% on LDC test set, and with the acoustic model co-training, 0.8% additional CER decrease had been achieved.

  • Speech Enhancement Using Improved Adaptive Null-Forming in Frequency Domain with Postfilter

    Heng ZHANG  Qiang FU  Yonghong YAN  

     
    LETTER-Speech and Hearing

      Vol:
    E91-A No:12
      Page(s):
    3812-3816

    In this letter, a two channel frequency domain speech enhancement algorithm is proposed. The algorithm is designed to achieve better overall performance with relatively small array size. An improved version of adaptive null-forming is used, in which noise cancelation is implemented in auditory subbands. And an OM-LSA based postfiltering stage further purifies the output. The algorithm also features interaction between the array processing and the postfilter to make the filter adaptation more robust. This approach achieves considerable improvement on signal-to-noise ratio (SNR) and subjective quality of the desired speech. Experiments confirm the effectiveness of the proposed system.

  • Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

    Fengpei GE  Changliang LIU  Jian SHAO  Fuping PAN  Bin DONG  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:10
      Page(s):
    2485-2492

    In this paper we present our investigation into improving the performance of our computer-assisted language learning (CALL) system through exploiting the acoustic model and features within the speech recognition framework. First, to alleviate channel distortion, speaker-dependent cepstrum mean normalization (CMN) is adopted and the average correlation coefficient (average CC) between machine and expert scores is improved from 78.00% to 84.14%. Second, heteroscedastic linear discriminant analysis (HLDA) is adopted to enhance the discriminability of the acoustic model, which successfully increases the average CC from 84.14% to 84.62%. Additionally, HLDA causes the scoring accuracy to be more stable at various pronunciation proficiency levels, and thus leads to an increase in the speaker correct-rank rate from 85.59% to 90.99%. Finally, we use maximum a posteriori (MAP) estimation to tune the acoustic model to fit strongly accented test speech. As a result, the average CC is improved from 84.62% to 86.57%. These three novel techniques improve the accuracy of evaluating pronunciation quality.

  • Fuzzy Matching of Semantic Class in Chinese Spoken Language Understanding

    Yanling LI  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Natural Language Processing

      Vol:
    E96-D No:8
      Page(s):
    1845-1852

    Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.

  • A One-Pass Real-Time Decoder Using Memory-Efficient State Network

    Jian SHAO  Ta LI  Qingqing ZHANG  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-ASR System Architecture

      Vol:
    E91-D No:3
      Page(s):
    529-537

    This paper presents our developed decoder which adopts the idea of statically optimizing part of the knowledge sources while handling the others dynamically. The lexicon, phonetic contexts and acoustic model are statically integrated to form a memory-efficient state network, while the language model (LM) is dynamically incorporated on the fly by means of extended tokens. The novelties of our approach for constructing the state network are (1) introducing two layers of dummy nodes to cluster the cross-word (CW) context dependent fan-in and fan-out triphones, (2) introducing a so-called "WI layer" to store the word identities and putting the nodes of this layer in the non-shared mid-part of the network, (3) optimizing the network at state level by a sufficient forward and backward node-merge process. The state network is organized as a multi-layer structure for distinct token propagation at each layer. By exploiting the characteristics of the state network, several techniques including LM look-ahead, LM cache and beam pruning are specially designed for search efficiency. Especially in beam pruning, a layer-dependent pruning method is proposed to further reduce the search space. The layer-dependent pruning takes account of the neck-like characteristics of WI layer and the reduced variety of word endings, which enables tighter beam without introducing much search errors. In addition, other techniques including LM compression, lattice-based bookkeeping and lattice garbage collection are also employed to reduce the memory requirements. Experiments are carried out on a Mandarin spontaneous speech recognition task where the decoder involves a trigram LM and CW triphone models. A comparison with HDecode of HTK toolkits shows that, within 1% performance deviation, our decoder can run 5 times faster with half of the memory footprint.

  • Automatic Language Identification with Discriminative Language Characterization Based on SVM

    Hongbin SUO  Ming LI  Ping LU  Yonghong YAN  

     
    PAPER-Language Identification

      Vol:
    E91-D No:3
      Page(s):
    567-575

    Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by classifiers. In this paper, in order to increase the dimension of the score vector and alleviate the inter-speaker variability within the same language, multiple data groups based on supervised speaker clustering are employed to generate the discriminative language characterization score vectors (DLCSV). The back-end SVM classifiers are used to model the probability distribution of each target language in the DLCSV space. Finally, the output scores of back-end classifiers are calibrated by a pair-wise posterior probability estimation (PPPE) algorithm. The proposed language identification frameworks are evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases and the experiments show that the system described in this paper produces comparable results to the existing systems. Especially, the SVM framework achieves an equal error rate (EER) of 4.0% in the 30-second task and outperforms the state-of-art systems by more than 30% relative error reduction. Besides, the performances of proposed PPRLM and GMMs algorithms achieve an EER of 5.1% and 5.0% respectively.

  • A Hybrid Approach for Reverberation Simulation

    Risheng XIA  Junfeng LI  Andrea PRIMAVERA  Stefania CECCHI  Yôiti SUZUKI  Yonghong YAN  

     
    PAPER-Engineering Acoustics

      Vol:
    E98-A No:10
      Page(s):
    2101-2108

    Real-time high-quality reverberation simulation plays an important role in many modern applications, such as computer games and virtual reality. Traditional physically and perceptually-based reverberation simulation techniques, however, suffer from the high computational cost and lack of the specific characteristics of the enclosure. In this paper, a hybrid reverberation simulation approach is proposed in which early reflections are reproduced by convolving with the impulse response modeled by the image-source method and late reverberation is reproduced by the feedback delay network. A parametric predictor of energy decay relief is presented for the modeled early reflections and then exploited to compute the parameters of the feedback delay network. This ensures the smooth transition in the time-frequency domain from early to late reflections. Numerical simulation and listening tests validate the effectiveness of this proposed hybrid reverberation simulation approach.

  • A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition

    Dongni HU  Chengxin CHEN  Pengyuan ZHANG  Junfeng LI  Yonghong YAN  Qingwei ZHAO  

     
    LETTER-Human-computer Interaction

      Pubricized:
    2021/04/30
      Vol:
    E104-D No:8
      Page(s):
    1391-1394

    Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.

  • Improved End-to-End Speech Recognition Using Adaptive Per-Dimensional Learning Rate Methods

    Xuyang WANG  Pengyuan ZHANG  Qingwei ZHAO  Jielin PAN  Yonghong YAN  

     
    LETTER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2550-2553

    The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.

  • Speeding up Deep Neural Networks in Speech Recognition with Piecewise Quantized Sigmoidal Activation Function

    Anhao XING  Qingwei ZHAO  Yonghong YAN  

     
    LETTER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2558-2561

    This paper proposes a new quantization framework on activation function of deep neural networks (DNN). We implement fixed-point DNN by quantizing the activations into powers-of-two integers. The costly multiplication operations in using DNN can be replaced with low-cost bit-shifts to massively save computations. Thus, applying DNN-based speech recognition on embedded systems becomes much easier. Experiments show that the proposed method leads to no performance degradation.

  • Short Text Classification Based on Distributional Representations of Words

    Chenglong MA  Qingwei ZHAO  Jielin PAN  Yonghong YAN  

     
    LETTER-Text classification

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2562-2565

    Short texts usually encounter the problem of data sparseness, as they do not provide sufficient term co-occurrence information. In this paper, we show how to mitigate the problem in short text classification through word embeddings. We assume that a short text document is a specific sample of one distribution in a Gaussian-Bayesian framework. Furthermore, a fast clustering algorithm is utilized to expand and enrich the context of short text in embedding space. This approach is compared with those based on the classical bag-of-words approaches and neural network based methods. Experimental results validate the effectiveness of the proposed method.

  • Melody Track Selection Using Discriminative Language Model

    Xiao WU  Ming LI  Hongbin SUO  Yonghong YAN  

     
    LETTER-Music Information Processing

      Vol:
    E91-D No:6
      Page(s):
    1838-1840

    In this letter we focus on the task of selecting the melody track from a polyphonic MIDI file. Based on the intuition that music and language are similar in many aspects, we solve the selection problem by introducing an n-gram language model to learn the melody co-occurrence patterns in a statistical manner and determine the melodic degree of a given MIDI track. Furthermore, we propose the idea of using background model and posterior probability criteria to make modeling more discriminative. In the evaluation, the achieved 81.6% correct rate indicates the feasibility of our approach.

  • Enhancing the Robustness of the Posterior-Based Confidence Measures Using Entropy Information for Speech Recognition

    Yanqing SUN  Yu ZHOU  Qingwei ZHAO  Pengyuan ZHANG  Fuping PAN  Yonghong YAN  

     
    PAPER-Robust Speech Recognition

      Vol:
    E93-D No:9
      Page(s):
    2431-2439

    In this paper, the robustness of the posterior-based confidence measures is improved by utilizing entropy information, which is calculated for speech-unit-level posteriors using only the best recognition result, without requiring a larger computational load than conventional methods. Using different normalization methods, two posterior-based entropy confidence measures are proposed. Practical details are discussed for two typical levels of hidden Markov model (HMM)-based posterior confidence measures, and both levels are compared in terms of their performances. Experiments show that the entropy information results in significant improvements in the posterior-based confidence measures. The absolute improvements of the out-of-vocabulary (OOV) rejection rate are more than 20% for both the phoneme-level confidence measures and the state-level confidence measures for our embedded test sets, without a significant decline of the in-vocabulary accuracy.

  • Smoothing Method for Improved Minimum Phone Error Linear Regression

    Yaohui QI  Fuping PAN  Fengpei GE  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:8
      Page(s):
    2105-2113

    A smoothing method for minimum phone error linear regression (MPELR) is proposed in this paper. We show that the objective function for minimum phone error (MPE) can be combined with a prior mean distribution. When the prior mean distribution is based on maximum likelihood (ML) estimates, the proposed method is the same as the previous smoothing technique for MPELR. Instead of ML estimates, maximum a posteriori (MAP) parameter estimate is used to define the mode of prior mean distribution to improve the performance of MPELR. Experiments on a large vocabulary speech recognition task show that the proposed method can obtain 8.4% relative reduction in word error rate when the amount of data is limited, while retaining the same asymptotic performance as conventional MPELR. When compared with discriminative maximum a posteriori linear regression (DMAPLR), the proposed method shows improvement except for the case of limited adaptation data for supervised adaptation.

21-34hit(34hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.