This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Junichi YAMAGISHI, Masatsune TAMURA, Takashi MASUKO, Keiichi TOKUDA, Takao KOBAYASHI, "A Context Clustering Technique for Average Voice Models" in IEICE TRANSACTIONS on Information,
vol. E86-D, no. 3, pp. 534-542, March 2003, doi: .
Abstract: This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
URL: https://globals.ieice.org/en_transactions/information/10.1587/e86-d_3_534/_p
Copy
@ARTICLE{e86-d_3_534,
author={Junichi YAMAGISHI, Masatsune TAMURA, Takashi MASUKO, Keiichi TOKUDA, Takao KOBAYASHI, },
journal={IEICE TRANSACTIONS on Information},
title={A Context Clustering Technique for Average Voice Models},
year={2003},
volume={E86-D},
number={3},
pages={534-542},
abstract={This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.},
keywords={},
doi={},
ISSN={},
month={March},}
Copy
TY - JOUR
TI - A Context Clustering Technique for Average Voice Models
T2 - IEICE TRANSACTIONS on Information
SP - 534
EP - 542
AU - Junichi YAMAGISHI
AU - Masatsune TAMURA
AU - Takashi MASUKO
AU - Keiichi TOKUDA
AU - Takao KOBAYASHI
PY - 2003
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E86-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2003
AB - This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
ER -