1-8hit |
A speaker independent algorithm is given which automatically detects the most steady-state portion of a vowel (vowel center) from continuous speech. The algorithm first extracts the segments each of which contains a vowel and, if present, pre- and/or post-vocalic liquids and semivowels, and then locates the most steady-state portion of the segment. An advantage of the algorithm is its ability to distinguish the nasal and the intervocalic liquid and semivowel segments without relying upon the formant frequencies which have been used in most of the previous work of vowel segment detection procedure. This results in a computationally simple algorithm. The test on ten sentences spoken by each of two males and two females resulted in score of 93.2% correct vowel center localization.
Experiments were performed to investigate perceptual contributions of static and dynamic features of vocal tract characteristics to talker individuality. An ARX (Auto-regressive with exogenous input) speech production model was used to extract separately voice source and vocal tract parameters from a Japanese sentence, /aoiueoie/ ("Say blue top" in English) uttered by three males. The Discrete Cosine Transform (DCT) was applied to resolve formant trajectories of the speech signal into static and dynamic components. The perceptual contributions were quantitatively studied by systematically replacing the corresponding formant components of the sentences between the three talkers. Results of the experiments show that the static (average) feature of the vocal tract is a primary cue to talker individuality.
Hiroki MORI Wakana ODAGIRI Hideki KASUYA
Transitional fundamental frequency (F0) characteristics comprise a crucial part of F0 dynamics in singing. This paper examines the F0 characteristics during the note transition period. An analysis of the singing voice of a professional baritone strongly suggests that asymmetries exist in the mechanisms used for controlling rising and falling. Specifically, the F0 contour in rising transitions can be modeled as a step response from a critically-damped second-order linear system with fixed average/maximum speed of change, whereas that in falling transitions can be modeled as a step response from an underdamped second-order linear system with fixed transition time. The validity of the model is examined through auditory experiments using synthesized singing voice.
The paper indicates the importance of suitability assesment in speech synthesis applications. Human factors involved in the use of a synthetic speech are first discussed on the basis of an example of a newspaper company where synthetic speech is extensively used as an aid for proofreading a manuscript. Some findings obtained from perceptual experiments on the subjects' preference for paralinguistic properties of synthetic speech are then described, focusing primarily on the suitability of pitch characteristics, speaker's gender, and speaking rates in the task where subjects are asked to proofread a printed text while listening to the speech. The paper finally claims the need for a flexibile speech synthesis system which helps the users create their own synthetic speech.
Yoshinobu KIKUCHI Satoshi UCHIDA Hideki KASUYA
In order to achieve high speed measurements of acoustic parameters needed for evaluating pathological voice, an integrated voice analyzer (IVA) has been developed by using a digital signal processor and a general purpose microprocessor. By utilizing a personal computer as a controller of the IVA, a versatile system can be constructed for the acoustic evaluation of pathological voice.
Xueming GAO Yoshinobu KIKUCHI Hideki KASUYA
An improved algorithm of autocorrelation pitch detection is presented. Preliminary experiments show that the algorithm can considerably reduce the errors caused by the ordinary autocorrelation pitch detector.
Wen DING Hideki KASUYA Shuichi ADACHI
A novel adaptive pitch-synchronous analysis method is proposed to estimate simultaneously vocal tract (formant/antiformant) and voice source parameters from speech waveforms. We use the parametric Rosenberg-Klatt (RK) model to generate a glottal waveform and an autoregressive-exogenous (ARX) model to represent voiced speech production process. The Kalman filter algorithm is used to estimate the formant/antiformant parameters from the coefficient of the ARX model, and the simulated annealing method is employed as a nonlinear optimization approach to estimate the voice source parameters. The two approaches work together in a system identification procedure to find the best set of the parameters of both the models. The new method has been compared using synthetic speech with some other approaches in terms of accuracy of estimated parameter values and has been proved to be superior. We also show that the proposed method can estimate accurately the parameters from natural speech sounds. A major application of the analysis method lies in a concatenative formant synthesizer which allows us to make flexible control of voice quality of synthetic speech.
Chang-Sheng YANG Hideki KASUYA
Three-dimensional vocal tract shapes of a male, a female and a child subjects are measured from magnetic resonance (MR) images during sustained phonation of Japanese vowels /a, i, u, e, o/. Non-uniform dimensional differences in the vocal tract shapes of the subjects are quantitatively measured. Vocal tract area functions of the female and child subjects are normalized to those of the male on the basis of non-uniform and uniform scalings of the vocal tract length and compared with each other. A comparison is also made between the formant frequencies computed from the area functions normalized by the two different scalings. It is suggested by the comparisons that non-uniformity in the vocal tract dimensions is not essential in the normalization of the five Japanese vowels.