Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization

Naoto IWAHASHI; Nobuyoshi KAIKI; Yoshinori SAGISAKA

Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization

Naoto IWAHASHI, Nobuyoshi KAIKI, Yoshinori SAGISAKA

Full Text Views

0

Share
Cite this

Summary :

This paper proposes a new scheme for concatenative speech synthesis to improve the speech segment selection procedure. The proposed scheme selects a segment sequence for concatenation by minimizing acoustic distortions between the selected segment and the desired spectrum for the target without the use of heuristics. Four types of distortion, a) the spectral prototypicality of a segment, b) the spectral difference between the source and target contexts, c) the degradation resulting from concatenation of phonemes, and d) the acoustic discontinuity between the concatenated segments, are formulated as acoustic quantities, and used as measures for minimization. A search method for selecting segments from a large speech database is also descrided. In this method, a three-step optimization using dynamic programming is used to minimize the four types of distortion. A perceptual test shows that this proposed segment selection method with minimum distortion criteria produces high quality synthesized speech, and that contextual spectral difference and acoustic discontinuity at the segment boundary are important measures for improving the quality.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E76-A No.11 pp.1942-1948

Publication Date: 1993/11/25

Publicized

Online ISSN

DOI

Type of Manuscript: Special Section PAPER (Special Section on Speech Synthesis: Current Technologies and Thier Application)

Category

Cite this

Copy

Naoto IWAHASHI, Nobuyoshi KAIKI, Yoshinori SAGISAKA, "Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization" in IEICE TRANSACTIONS on Fundamentals, vol. E76-A, no. 11, pp. 1942-1948, November 1993, doi: .
Abstract: This paper proposes a new scheme for concatenative speech synthesis to improve the speech segment selection procedure. The proposed scheme selects a segment sequence for concatenation by minimizing acoustic distortions between the selected segment and the desired spectrum for the target without the use of heuristics. Four types of distortion, a) the spectral prototypicality of a segment, b) the spectral difference between the source and target contexts, c) the degradation resulting from concatenation of phonemes, and d) the acoustic discontinuity between the concatenated segments, are formulated as acoustic quantities, and used as measures for minimization. A search method for selecting segments from a large speech database is also descrided. In this method, a three-step optimization using dynamic programming is used to minimize the four types of distortion. A perceptual test shows that this proposed segment selection method with minimum distortion criteria produces high quality synthesized speech, and that contextual spectral difference and acoustic discontinuity at the segment boundary are important measures for improving the quality.
URL: https://globals.ieice.org/en_transactions/fundamentals/10.1587/e76-a_11_1942/_p

Copy

@ARTICLE{e76-a_11_1942,
author={Naoto IWAHASHI, Nobuyoshi KAIKI, Yoshinori SAGISAKA, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization},
year={1993},
volume={E76-A},
number={11},
pages={1942-1948},
abstract={This paper proposes a new scheme for concatenative speech synthesis to improve the speech segment selection procedure. The proposed scheme selects a segment sequence for concatenation by minimizing acoustic distortions between the selected segment and the desired spectrum for the target without the use of heuristics. Four types of distortion, a) the spectral prototypicality of a segment, b) the spectral difference between the source and target contexts, c) the degradation resulting from concatenation of phonemes, and d) the acoustic discontinuity between the concatenated segments, are formulated as acoustic quantities, and used as measures for minimization. A search method for selecting segments from a large speech database is also descrided. In this method, a three-step optimization using dynamic programming is used to minimize the four types of distortion. A perceptual test shows that this proposed segment selection method with minimum distortion criteria produces high quality synthesized speech, and that contextual spectral difference and acoustic discontinuity at the segment boundary are important measures for improving the quality.},
keywords={},
doi={},
ISSN={},
month={November},}

Copy

TY - JOUR
TI - Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1942
EP - 1948
AU - Naoto IWAHASHI
AU - Nobuyoshi KAIKI
AU - Yoshinori SAGISAKA
PY - 1993
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E76-A
IS - 11
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - November 1993
AB - This paper proposes a new scheme for concatenative speech synthesis to improve the speech segment selection procedure. The proposed scheme selects a segment sequence for concatenation by minimizing acoustic distortions between the selected segment and the desired spectrum for the target without the use of heuristics. Four types of distortion, a) the spectral prototypicality of a segment, b) the spectral difference between the source and target contexts, c) the degradation resulting from concatenation of phonemes, and d) the acoustic discontinuity between the concatenated segments, are formulated as acoustic quantities, and used as measures for minimization. A search method for selecting segments from a large speech database is also descrided. In this method, a three-step optimization using dynamic programming is used to minimize the four types of distortion. A perceptual test shows that this proposed segment selection method with minimum distortion criteria produces high quality synthesized speech, and that contextual spectral difference and acoustic discontinuity at the segment boundary are important measures for improving the quality.
ER -