This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Junichi YAMAGISHI, Masatsune TAMURA, Takashi MASUKO, Keiichi TOKUDA, Takao KOBAYASHI, "A Training Method of Average Voice Model for HMM-Based Speech Synthesis" in IEICE TRANSACTIONS on Fundamentals,
vol. E86-A, no. 8, pp. 1956-1963, August 2003, doi: .
Abstract: This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
URL: https://globals.ieice.org/en_transactions/fundamentals/10.1587/e86-a_8_1956/_p
Copy
@ARTICLE{e86-a_8_1956,
author={Junichi YAMAGISHI, Masatsune TAMURA, Takashi MASUKO, Keiichi TOKUDA, Takao KOBAYASHI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={A Training Method of Average Voice Model for HMM-Based Speech Synthesis},
year={2003},
volume={E86-A},
number={8},
pages={1956-1963},
abstract={This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.},
keywords={},
doi={},
ISSN={},
month={August},}
Copy
TY - JOUR
TI - A Training Method of Average Voice Model for HMM-Based Speech Synthesis
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1956
EP - 1963
AU - Junichi YAMAGISHI
AU - Masatsune TAMURA
AU - Takashi MASUKO
AU - Keiichi TOKUDA
AU - Takao KOBAYASHI
PY - 2003
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E86-A
IS - 8
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - August 2003
AB - This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
ER -