Author Search Result

[Author] Koichi SHINODA(9hit)

1-9hit
  • Acoustic Model Adaptation for Speech Recognition

    Koichi SHINODA  

     
    INVITED PAPER

      Vol:
    E93-D No:9
      Page(s):
    2348-2362

    Statistical speech recognition using continuous-density hidden Markov models (CDHMMs) has yielded many practical applications. However, in general, mismatches between the training data and input data significantly degrade recognition accuracy. Various acoustic model adaptation techniques using a few input utterances have been employed to overcome this problem. In this article, we survey these adaptation techniques, including maximum a posteriori (MAP) estimation, maximum likelihood linear regression (MLLR), and eigenvoice. We also present a schematic view called the adaptation pyramid to illustrate how these methods relate to each other.

  • Robust Scene Extraction Using Multi-Stream HMMs for Baseball Broadcast

    Nguyen Huu BACH  Koichi SHINODA  Sadaoki FURUI  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E89-D No:9
      Page(s):
    2553-2561

    In this paper, we propose a robust statistical framework for extracting scenes from a baseball broadcast video. We apply multi-stream hidden Markov models (HMMs) to control the weights among different features. To achieve a large robustness against new scenes, we used a common simple structure for all the HMMs. In addition, scene segmentation and unsupervised adaptation were applied to achieve greater robustness against differences in environmental conditions among games. The F-measure of scene-extracting experiments for eight types of scene from 4.5 hours of digest data was 77.4% and was increased to 78.7% by applying scene segmentation. Furthermore, the unsupervised adaptation method improved precision by 2.7 points to 81.4%. These results confirm the effectiveness of our framework.

  • Error Correction Using Long Context Match for Smartphone Speech Recognition

    Yuan LIANG  Koji IWANO  Koichi SHINODA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/31
      Vol:
    E98-D No:11
      Page(s):
    1932-1942

    Most error correction interfaces for speech recognition applications on smartphones require the user to first mark an error region and choose the correct word from a candidate list. We propose a simple multimodal interface to make the process more efficient. We develop Long Context Match (LCM) to get candidates that complement the conventional word confusion network (WCN). Assuming that not only the preceding words but also the succeeding words of the error region are validated by users, we use such contexts to search higher-order n-grams corpora for matching word sequences. For this purpose, we also utilize the Web text data. Furthermore, we propose a combination of LCM and WCN (“LCM + WCN”) to provide users with candidate lists that are more relevant than those yielded by WCN alone. We compare our interface with the WCN-based interface on the Corpus of Spontaneous Japanese (CSJ). Our proposed “LCM + WCN” method improved the 1-best accuracy by 23%, improved the Mean Reciprocal Rank (MRR) by 28%, and our interface reduced the user's load by 12%.

  • Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network

    Mariana RODRIGUES MAKIUCHI  Tifani WARNITA  Nakamasa INOUE  Koichi SHINODA  Michitaka YOSHIMURA  Momoko KITAZAWA  Kei FUNAKI  Yoko EGUCHI  Taishiro KISHIMOTO  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2021/08/03
      Vol:
    E104-D No:11
      Page(s):
    1930-1940

    We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1% on the Pitt Corpus using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7% using 4 seconds of speech data and it improves to 80.8% when we use all the patient's speech data. Furthermore, we evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6% with 40 seconds of speech data.

  • Spectral Subtraction Based on Non-extensive Statistics for Speech Recognition

    Hilman PARDEDE  Koji IWANO  Koichi SHINODA  

     
    PAPER-Speech and Hearing

      Vol:
    E96-D No:8
      Page(s):
    1774-1782

    Spectral subtraction (SS) is an additive noise removal method which is derived in an extensive framework. In spectral subtraction, it is assumed that speech and noise spectra follow Gaussian distributions and are independent with each other. Hence, noisy speech also follows a Gaussian distribution. Spectral subtraction formula is obtained by maximizing the likelihood of noisy speech distribution with respect to its variance. However, it is well known that noisy speech observed in real situations often follows a heavy-tailed distribution, not a Gaussian distribution. In this paper, we introduce a q-Gaussian distribution in the non-extensive statistics to represent the distribution of noisy speech and derive a new spectral subtraction method based on it. We found that the q-Gaussian distribution fits the noisy speech distribution better than the Gaussian distribution does. Our speech recognition experiments using the Aurora-2 database showed that the proposed method, q-spectral subtraction (q-SS), outperformed the conventional SS method.

  • Robust Gait-Based Person Identification against Walking Speed Variations

    Muhammad Rasyid AQMAR  Koichi SHINODA  Sadaoki FURUI  

     
    PAPER-Image Recognition, Computer Vision

      Vol:
    E95-D No:2
      Page(s):
    668-676

    Variations in walking speed have a strong impact on gait-based person identification. We propose a method that is robust against walking-speed variations. It is based on a combination of cubic higher-order local auto-correlation (CHLAC), gait silhouette-based principal component analysis (GSP), and a statistical framework using hidden Markov models (HMMs). The CHLAC features capture the within-phase spatio-temporal characteristics of each individual, the GSP features retain more shape/phase information for better gait sequence alignment, and the HMMs classify the ID of each gait even when walking speed changes nonlinearly. We compared the performance of our method with other conventional methods using five different databases, SOTON, USF-NIST, CMU-MoBo, TokyoTech A and TokyoTech B. The proposed method was equal to or better than the others when the speed did not change greatly, and it was significantly better when the speed varied across and within a gait sequence.

  • Online Speaker Clustering Using Incremental Learning of an Ergodic Hidden Markov Model

    Takafumi KOSHINAKA  Kentaro NAGATOMO  Koichi SHINODA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:10
      Page(s):
    2469-2478

    A novel online speaker clustering method based on a generative model is proposed. It employs an incremental variant of variational Bayesian learning and provides probabilistic (non-deterministic) decisions for each input utterance, on the basis of the history of preceding utterances. It can be expected to be robust against errors in cluster estimation and the classification of utterances, and hence to be applicable to many real-time applications. Experimental results show that it produces 50% fewer classification errors than does a conventional online method. They also show that it is possible to reduce the number of speech recognition errors by combining the method with unsupervised speaker adaptation.

  • Active Learning Using Phone-Error Distribution for Speech Modeling

    Hiroko MURAKAMI  Koichi SHINODA  Sadaoki FURUI  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:10
      Page(s):
    2486-2494

    We propose an active learning framework for speech recognition that reduces the amount of data required for acoustic modeling. This framework consists of two steps. We first obtain a phone-error distribution using an acoustic model estimated from transcribed speech data. Then, from a text corpus we select a sentence whose phone-occurrence distribution is close to the phone-error distribution and collect its speech data. We repeat this process to increase the amount of transcribed speech data. We applied this framework to speaker adaptation and acoustic model training. Our evaluation results showed that it significantly reduced the amount of transcribed data while maintaining the same level of accuracy.

  • Committee-Based Active Learning for Speech Recognition

    Yuzo HAMANAKA  Koichi SHINODA  Takuya TSUTAOKA  Sadaoki FURUI  Tadashi EMORI  Takafumi KOSHINAKA  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:10
      Page(s):
    2015-2023

    We propose a committee-based method of active learning for large vocabulary continuous speech recognition. Multiple recognizers are trained in this approach, and the recognition results obtained from these are used for selecting utterances. Those utterances whose recognition results differ the most among recognizers are selected and transcribed. Progressive alignment and voting entropy are used to measure the degree of disagreement among recognizers on the recognition result. Our method was evaluated by using 191-hour speech data in the Corpus of Spontaneous Japanese. It proved to be significantly better than random selection. It only required 63 h of data to achieve a word accuracy of 74%, while standard training (i.e., random selection) required 103 h of data. It also proved to be significantly better than conventional uncertainty sampling using word posterior probabilities.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.