Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
Kazunori KOMATANI
Osaka University
Naoki HOTTA
Nagoya University
Satoshi SATO
Nagoya University
Mikio NAKANO
Honda Research Institute Japan, Co., Ltd.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Kazunori KOMATANI, Naoki HOTTA, Satoshi SATO, Mikio NAKANO, "Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances" in IEICE TRANSACTIONS on Information,
vol. E98-D, no. 11, pp. 1923-1931, November 2015, doi: 10.1587/transinf.2015EDP7014.
Abstract: Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.2015EDP7014/_p
Copy
@ARTICLE{e98-d_11_1923,
author={Kazunori KOMATANI, Naoki HOTTA, Satoshi SATO, Mikio NAKANO, },
journal={IEICE TRANSACTIONS on Information},
title={Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances},
year={2015},
volume={E98-D},
number={11},
pages={1923-1931},
abstract={Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).},
keywords={},
doi={10.1587/transinf.2015EDP7014},
ISSN={1745-1361},
month={November},}
Copy
TY - JOUR
TI - Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances
T2 - IEICE TRANSACTIONS on Information
SP - 1923
EP - 1931
AU - Kazunori KOMATANI
AU - Naoki HOTTA
AU - Satoshi SATO
AU - Mikio NAKANO
PY - 2015
DO - 10.1587/transinf.2015EDP7014
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E98-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2015
AB - Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
ER -