Cross-Dialectal Voice Conversion with Neural Networks

Weixun GAO; Qiying CAO; Yao QIAN

doi:10.1587/transinf.2014EDP7116

Cross-Dialectal Voice Conversion with Neural Networks

Weixun GAO, Qiying CAO, Yao QIAN

Full Text Views

0

Share
Cite this

Summary :

In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.11 pp.2872-2880

Publication Date: 2014/11/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2014EDP7116

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Weixun GAO
  Donghua Univeristy
Qiying CAO
  Donghua Univeristy
Yao QIAN
  Speech Group of Microsoft Research Asia

Keyword

voice conversion, neural network, cross-dialectal, frequency warping, pre-training, sequence training

Cite this

Copy

Weixun GAO, Qiying CAO, Yao QIAN, "Cross-Dialectal Voice Conversion with Neural Networks" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 11, pp. 2872-2880, November 2014, doi: 10.1587/transinf.2014EDP7116.
Abstract: In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.2014EDP7116/_p

Copy

@ARTICLE{e97-d_11_2872,
author={Weixun GAO, Qiying CAO, Yao QIAN, },
journal={IEICE TRANSACTIONS on Information},
title={Cross-Dialectal Voice Conversion with Neural Networks},
year={2014},
volume={E97-D},
number={11},
pages={2872-2880},
abstract={In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.},
keywords={},
doi={10.1587/transinf.2014EDP7116},
ISSN={1745-1361},
month={November},}

Copy

TY - JOUR
TI - Cross-Dialectal Voice Conversion with Neural Networks
T2 - IEICE TRANSACTIONS on Information
SP - 2872
EP - 2880
AU - Weixun GAO
AU - Qiying CAO
AU - Yao QIAN
PY - 2014
DO - 10.1587/transinf.2014EDP7116
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2014
AB - In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.
ER -