Author Search Result

[Author] Yasunari OBUCHI(6hit)

1-6hit
  • Multi-Input Feature Combination in the Cepstral Domain for Practical Speech Recognition Systems

    Yasunari OBUCHI  Nobuo HATAOKA  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:4
      Page(s):
    662-670

    In this paper we describe a new framework of feature combination in the cepstral domain for multi-input robust speech recognition. The general framework of working in the cepstral domain has various advantages over working in the time or hypothesis domain. It is stable, easy to maintain, and less expensive because it does not require precise calibration. It is also easy to configure in a complex speech recognition system. However, it is not straightforward to improve the recognition performance by increasing the number of inputs, and we introduce the concept of variance re-scaling to compensate the negative effect of averaging several input features. Finally, we propose to take another advantage of working in the cepstral domain. The speech can be modeled using hidden Markov models, and the model can be used as prior knowledge. This approach is formulated as a new algorithm, referred to as Hypothesis-Based Feature Combination. The effectiveness of various algorithms are evaluated using two sets of speech databases. We also refer to automatic optimization of some parameters in the proposed algorithms.

  • Multichannel Two-Stage Beamforming with Unconstrained Beamformer and Distortion Reduction

    Masahito TOGAMI  Yohei KAWAGUCHI  Yasunari OBUCHI  

     
    PAPER-Engineering Acoustics

      Vol:
    E96-A No:4
      Page(s):
    749-761

    This paper proposes a novel multichannel speech enhancement technique for reverberant rooms that is effective when noise sources are spatially stationary, such as a projector fan noise, an air-conditioner noise, and unwanted speech sources at the back of microphones. Speech enhancement performance of the conventional multichannel Wiener filter (MWF) degrades when the Signal-to-Noise Ratio (SNR) of the current microphone input signal changes from the noise-only period. Furthermore, the MWF structure is computationally inefficient, because the MWF updates the whole spatial beamformer periodically to track switching of the speakers (e.g. turn-taking). In contrast to the MWF, the proposed method reduces noise independently of the SNR. The proposed method has a novel two-stage structure, which reduces noise and distortion of the desired source signal in a cascade manner by using two different beamformers. The first beamformer focuses on noise reduction without any constraint on the desired source, which is insensitive to SNR variation. However, the output signal after the first beamformer is distorted. The second beamformer focuses on distortion reduction of the desired source signal. Theoretically, complete elimination of distortion is assured. Additionally, the proposed method has a computationally efficient structure optimized for spatially stationary noise reduction problems. The first beamformer is updated only when the speech enhancement system is initialized. Only the second beamformer is updated periodically to track switching of the active speaker. The experimental results indicate that the proposed method can reduce spatially stationary noise source signals effectively with less distortion of the desired source signal even in a reverberant conference room.

  • In-Vehicle Voice Interface with Improved Utterance Classification Accuracy Using Off-the-Shelf Cloud Speech Recognizer

    Takeshi HOMMA  Yasunari OBUCHI  Kazuaki SHIMA  Rintaro IKESHITA  Hiroaki KOKUBO  Takuya MATSUMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2018/08/31
      Vol:
    E101-D No:12
      Page(s):
    3123-3137

    For voice-enabled car navigation systems that use a multi-purpose cloud speech recognition service (cloud ASR), utterance classification that is robust against speech recognition errors is needed to realize a user-friendly voice interface. The purpose of this study is to improve the accuracy of utterance classification for voice-enabled car navigation systems when inputs to a classifier are error-prone speech recognition results obtained from a cloud ASR. The role of utterance classification is to predict which car navigation function a user wants to execute from a spontaneous utterance. A cloud ASR causes speech recognition errors due to the noises that occur when traveling in a car, and the errors degrade the accuracy of utterance classification. There are many methods for reducing the number of speech recognition errors by modifying the inside of a speech recognizer. However, application developers cannot apply these methods to cloud ASRs because they cannot customize the ASRs. In this paper, we propose a system for improving the accuracy of utterance classification by modifying both speech-signal inputs to a cloud ASR and recognized-sentence outputs from an ASR. First, our system performs speech enhancement on a user's utterance and then sends both enhanced and non-enhanced speech signals to a cloud ASR. Speech recognition results from both speech signals are merged to reduce the number of recognition errors. Second, to reduce that of utterance classification errors, we propose a data augmentation method, which we call “optimal doping,” where not only accurate transcriptions but also error-prone recognized sentences are added to training data. An evaluation with real user utterances spoken to car navigation products showed that our system reduces the number of utterance classification errors by 54% from a baseline condition. Finally, we propose a semi-automatic upgrading approach for classifiers to benefit from the improved performance of cloud ASRs.

  • Normalization of Time-Derivative Parameters for Robust Speech Recognition in Small Devices

    Yasunari OBUCHI  Nobuo HATAOKA  Richard M. STERN  

     
    PAPER-Speech and Hearing

      Vol:
    E87-D No:4
      Page(s):
    1004-1011

    In this paper we describe a new framework of feature compensation for robust speech recognition, which is suitable especially for small devices. We introduce Delta-cepstrum Normalization (DCN) that normalizes not only cepstral coefficients, but also their time-derivatives. Cepstral Mean Normalization (CMN) and Mean and Variance Normalization (MVN) are fast and efficient algorithms of environmental adaptation, and have been used widely. In those algorithms, normalization was applied to cepstral coefficients to reduce the irrelevant information from them, but such a normalization was not applied to time-derivative parameters because the reduction of the irrelevant information was not enough. However, Histogram Equalization (HEQ) provides better compensation and can be applied even to the delta and delta-delta cepstra. We investigate various implementation of DCN, and show that we can achieve the best performance when the normalization of the cepstra and the delta cepstra can be mutually interdependent. We evaluate the performance of DCN using speech data recorded by a PDA. DCN provides significant improvements compared to HEQ. It is shown that DCN gives 15% relative word error rate reduction from HEQ. We also examine the possibility of combining Vector Taylor Series (VTS) and DCN. Even though some combinations do not improve the performance of VTS, it is shown that the best combination gives the better performance than VTS alone. Finally, the advantage of DCN in terms of the computation speed is also discussed.

  • Stepwise Phase Difference Restoration Method for DOA Estimation of Multiple Sources

    Masahito TOGAMI  Yasunari OBUCHI  

     
    PAPER-Engineering Acoustics

      Vol:
    E91-A No:11
      Page(s):
    3269-3281

    We propose a new methodology of DOA (direction of arrival) estimation named SPIRE (Stepwise Phase dIfference REstoration) that is able to estimate sound source directions even if there is more than one source in a reverberant environment. DOA estimation in reverberant environments is difficult because the variance of the direction of an estimated sound source increases in reverberant environments. Therefore, we want the distance between microphones to be long. However, because of the spatial aliasing problem, the distance cannot be longer than half the wavelength of the maximum frequency of a source. DOA estimation performance of SPIRE is not limited by the spatial aliasing problem. The major feature of SPIRE is restoration of the phase difference of a microphone pair (M1) by using the phase difference of another microphone pair (M2) under the condition that the distance between the M1 microphones is longer than the distance between the M2 microphones. This restoration process enables the reduction of the variance of an estimated sound source direction and can alleviates the spatial aliasing problem that occurs with the M1 phase difference using direction estimation of the M2 microphones. The experimental results in a reverberant environment (reverberation time = about 300 ms) indicate that even when there are multiple sources, the proposed method can estimate the source direction more accurately than conventional methods. In addition, DOA estimation performance of SPIRE with the array length 0.2 m is shown to be almost equivalent to that of GCC-PHAT with the array length 0.5 m. SPIRE can executes DOA estimation with a smaller microphone array than GCC-PHAT. From the viewpoint of the hardware size and coherence problem, the array length is required to be as small as possible. This feature of SPIRE is preferable.

  • Intentional Voice Command Detection for Trigger-Free Speech Interface

    Yasunari OBUCHI  Takashi SUMIYOSHI  

     
    PAPER-Robust Speech Recognition

      Vol:
    E93-D No:9
      Page(s):
    2440-2450

    In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.