Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

Kazunori KOMATANI; Naoki HOTTA; Satoshi SATO; Mikio NAKANO

doi:10.1587/transinf.2015EDP7014

Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

Kazunori KOMATANI, Naoki HOTTA, Satoshi SATO, Mikio NAKANO

Full Text Views

0

Share
Cite this

Summary :

Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

Publication: IEICE TRANSACTIONS on Information Vol.E98-D No.11 pp.1923-1931

Publication Date: 2015/11/01

Publicized: 2015/07/24

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2015EDP7014

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Kazunori KOMATANI
  Osaka University
Naoki HOTTA
  Nagoya University
Satoshi SATO
  Nagoya University
Mikio NAKANO
  Honda Research Institute Japan, Co., Ltd.

Keyword

spoken dialogue system, VAD error, turn taking, a posteriori restoration

Cite this

Copy

Kazunori KOMATANI, Naoki HOTTA, Satoshi SATO, Mikio NAKANO, "Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances" in IEICE TRANSACTIONS on Information, vol. E98-D, no. 11, pp. 1923-1931, November 2015, doi: 10.1587/transinf.2015EDP7014.
Abstract: Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.2015EDP7014/_p

Copy

@ARTICLE{e98-d_11_1923,
author={Kazunori KOMATANI, Naoki HOTTA, Satoshi SATO, Mikio NAKANO, },
journal={IEICE TRANSACTIONS on Information},
title={Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances},
year={2015},
volume={E98-D},
number={11},
pages={1923-1931},
abstract={Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).},
keywords={},
doi={10.1587/transinf.2015EDP7014},
ISSN={1745-1361},
month={November},}

Copy

TY - JOUR
TI - Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances
T2 - IEICE TRANSACTIONS on Information
SP - 1923
EP - 1931
AU - Kazunori KOMATANI
AU - Naoki HOTTA
AU - Satoshi SATO
AU - Mikio NAKANO
PY - 2015
DO - 10.1587/transinf.2015EDP7014
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E98-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2015
AB - Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
ER -