Author Search Result

[Author] Fuping PAN(6hit)

1-6hit
  • Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

    Fengpei GE  Changliang LIU  Jian SHAO  Fuping PAN  Bin DONG  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:10
      Page(s):
    2485-2492

    In this paper we present our investigation into improving the performance of our computer-assisted language learning (CALL) system through exploiting the acoustic model and features within the speech recognition framework. First, to alleviate channel distortion, speaker-dependent cepstrum mean normalization (CMN) is adopted and the average correlation coefficient (average CC) between machine and expert scores is improved from 78.00% to 84.14%. Second, heteroscedastic linear discriminant analysis (HLDA) is adopted to enhance the discriminability of the acoustic model, which successfully increases the average CC from 84.14% to 84.62%. Additionally, HLDA causes the scoring accuracy to be more stable at various pronunciation proficiency levels, and thus leads to an increase in the speaker correct-rank rate from 85.59% to 90.99%. Finally, we use maximum a posteriori (MAP) estimation to tune the acoustic model to fit strongly accented test speech. As a result, the average CC is improved from 84.62% to 86.57%. These three novel techniques improve the accuracy of evaluating pronunciation quality.

  • Enhancing the Robustness of the Posterior-Based Confidence Measures Using Entropy Information for Speech Recognition

    Yanqing SUN  Yu ZHOU  Qingwei ZHAO  Pengyuan ZHANG  Fuping PAN  Yonghong YAN  

     
    PAPER-Robust Speech Recognition

      Vol:
    E93-D No:9
      Page(s):
    2431-2439

    In this paper, the robustness of the posterior-based confidence measures is improved by utilizing entropy information, which is calculated for speech-unit-level posteriors using only the best recognition result, without requiring a larger computational load than conventional methods. Using different normalization methods, two posterior-based entropy confidence measures are proposed. Practical details are discussed for two typical levels of hidden Markov model (HMM)-based posterior confidence measures, and both levels are compared in terms of their performances. Experiments show that the entropy information results in significant improvements in the posterior-based confidence measures. The absolute improvements of the out-of-vocabulary (OOV) rejection rate are more than 20% for both the phoneme-level confidence measures and the state-level confidence measures for our embedded test sets, without a significant decline of the in-vocabulary accuracy.

  • Smoothing Method for Improved Minimum Phone Error Linear Regression

    Yaohui QI  Fuping PAN  Fengpei GE  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:8
      Page(s):
    2105-2113

    A smoothing method for minimum phone error linear regression (MPELR) is proposed in this paper. We show that the objective function for minimum phone error (MPE) can be combined with a prior mean distribution. When the prior mean distribution is based on maximum likelihood (ML) estimates, the proposed method is the same as the previous smoothing technique for MPELR. Instead of ML estimates, maximum a posteriori (MAP) parameter estimate is used to define the mode of prior mean distribution to improve the performance of MPELR. Experiments on a large vocabulary speech recognition task show that the proposed method can obtain 8.4% relative reduction in word error rate when the amount of data is limited, while retaining the same asymptotic performance as conventional MPELR. When compared with discriminative maximum a posteriori linear regression (DMAPLR), the proposed method shows improvement except for the case of limited adaptation data for supervised adaptation.

  • A Forced Alignment Based Approach for English Passage Reading Assessment

    Junbo ZHANG  Fuping PAN  Bin DONG  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:12
      Page(s):
    3046-3052

    This paper presents our investigation into improving the performance of our previous automatic reading quality assessment system. The method of the baseline system is calculating the average value of the Phone Log-Posterior Probability (PLPP) of all phones in the voice to be assessed, and the average value is used as the reading quality assessment feature. In this paper, we presents three improvements. First, we cluster the triphones, and then calculate the average value of the normalized PLPP for each classification separately, and use this average values as the multi-dimensional assessment features instead of the original one-dimensional assessment feature. This method is simple but effective, which made the score difference of the machine scoring and manual scoring decrease by 30.2% relatively. Second, in order to assess the reading rhythm, we train Gaussian Mixture Models (GMM), which contain the information of each triphone's relative duration under standard pronunciation. Using the GMM, we can calculate the probability that the relative duration of each phone is conform to the standard pronunciation, and the average value of the probabilities is added to the assessment feature vector as a dimension of feature, which decreased the score difference between the machine scoring and manual scoring by 9.7% relatively. Third, we detect Filled Pauses (FP) by analyzing the formant curve, and then calculate the relative duration of FP, and add the relative duration of FP to the assessment feature vector as a dimension of feature. This method made the score difference between the machine scoring and manual scoring be further decreased by 10.2% relatively. Finally, when the feature vector extracted by the three methods are used together, the score difference between the machine scoring and manual scoring was decreased by 43.9% relatively compared to the baseline system.

  • An LVCSR Based Reading Miscue Detection System Using Knowledge of Reference and Error Patterns

    Changliang LIU  Fuping PAN  Fengpei GE  Bin DONG  Hongbin SUO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:9
      Page(s):
    1716-1724

    This paper describes a reading miscue detection system based on the conventional Large Vocabulary Continuous Speech Recognition (LVCSR) framework [1]. In order to incorporate the knowledge of reference (what the reader ought to read) and some error patterns into the decoding process, two methods are proposed: Dynamic Multiple Pronunciation Incorporation (DMPI) and Dynamic Interpolation of Language Model (DILM). DMPI dynamically adds some pronunciation variations into the search space to predict reading substitutions and insertions. To resolve the conflict between the coverage of error predications and the perplexity of the search space, only the pronunciation variants related to the reference are added. DILM dynamically interpolates the general language model based on the analysis of the reference and so keeps the active paths of decoding relatively near the reference. It makes the recognition more accurate, which further improves the detection performance. At the final stage of detection, an improved dynamic program (DP) is used to align the confusion network (CN) from speech recognition and the reference to generate the detecting result. The experimental results show that the proposed two methods can decrease the Equal Error Rate (EER) by 14% relatively, from 46.4% to 39.8%.

  • A Novel Discriminative Method for Pronunciation Quality Assessment

    Junbo ZHANG  Fuping PAN  Bin DONG  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E96-D No:5
      Page(s):
    1145-1151

    In this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular “Goodness of Pronunciation” (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.