1-19hit |
Ryota KAMINISHI Haruna MIYAMOTO Sayaka SHIOTA Hitoshi KIYA
This study evaluates the effects of some non-learning blind bandwidth extension (BWE) methods on state-of-the-art automatic speaker verification (ASV) systems. Recently, a non-linear bandwidth extension (N-BWE) method has been proposed as a blind, non-learning, and light-weight BWE approach. Other non-learning BWEs have also been developed in recent years. For ASV evaluations, most data available to train ASV systems is narrowband (NB) telephone speech. Meanwhile, wideband (WB) data have been used to train the state-of-the-art ASV systems, such as i-vector, d-vector, and x-vector. This can cause sampling rate mismatches when all datasets are used. In this paper, we investigate the influence of sampling rate mismatches in the x-vector-based ASV systems and how non-learning BWE methods perform against them. The results showed that the N-BWE method improved the equal error rate (EER) on ASV systems based on the x-vector when the mismatches were present. We researched the relationship between objective measurements and EERs. Consequently, the N-BWE method produced the lowest EERs on both ASV systems and obtained the lower RMS-LSD value and the higher STOI score.
Shengyu YAO Ruohua ZHOU Pengyuan ZHANG
This paper proposes a speaker-phonetic i-vector modeling method for text-dependent speaker verification with random digit strings, in which enrollment and test utterances are not of the same phrase. The core of the proposed method is making use of digit alignment information in i-vector framework. By utilizing force alignment information, verification scores of the testing trials can be computed in the fixed-phrase situation, in which the compared speech segments between the enrollment and test utterances are of the same phonetic content. Specifically, utterances are segmented into digits, then a unique phonetically-constrained i-vector extractor is applied to obtain speaker and channel variability representation for every digit segment. Probabilistic linear discriminant analysis (PLDA) and s-norm are subsequently used for channel compensation and score normalization respectively. The final score is obtained by combing the digit scores, which are computed by scoring individual digit segments of the test utterance against the corresponding ones of the enrollment. Experimental results on the Part 3 of Robust Speaker Recognition (RSR2015) database demonstrate that the proposed approach significantly outperforms GMM-UBM by 52.3% and 53.5% relative in equal error rate (EER) for male and female respectively.
Hanxu YOU Wei LI Lianqiang LI Jie ZHU
A text-dependent i-vector extraction scheme and a lexicon-based binary vector (L-vector) representation are proposed to improve the performance of text-dependent speaker verification. I-vector and L-vector are used to represent the utterances for enrollment and test. An improved cosine distance kernel is constructed by combining i-vector and L-vector together and is used to distinguish both speaker identity and lexical (or text) diversity with back-end support vector machine (SVM). Experiments are conducted on RSR 2015 Corpus part 1 and part 2, the results indicate that at most 30% improvement can be obtained compared with traditional i-vector baseline.
Yong FENG Qingyu XIONG Weiren SHI
Speaker verification is the task of determining whether two utterances represent the same person. After representing the utterances in the i-vector space, the crucial problem is only how to compute the similarity of two i-vectors. Metric learning has provided a viable solution to this problem. Until now, many metric learning algorithms have been proposed, but they are usually limited to learning a linear transformation. In this paper, we propose a nonlinear metric learning method, which learns an explicit mapping from the original space to an optimal subspace using deep Restricted Boltzmann Machine network. The proposed method is evaluated on the NIST SRE 2008 dataset. Since the proposed method has a deep learning architecture, the evaluation results show superior performance than some state-of-the-art methods.
Santi NURATCH Panuthat BOONPRAMUK Chai WUTIWIWATCHAI
This paper presents a new technique to smooth speech feature vectors for text-independent speaker verification using an adaptive band-pass IIR filer. The filter is designed by considering the probability density of modulation-frequency components of an M-dimensional feature vector. Each dimension of the feature vector is processed and filtered separately. Initial filter parameters, low-cut-off and high-cut-off frequencies, are first determined by the global mean of the probability densities computed from all feature vectors of a given speech utterance. Then, the cut-off frequencies are adapted over time, i.e. every frame vector, in both low-frequency and high-frequency bands based also on the global mean and the standard deviation of feature vectors. The filtered feature vectors are used in a SVM-GMM Supervector speaker verification system. The NIST Speaker Recognition Evaluation 2006 (SRE06) core-test is used in evaluation. Experimental results show that the proposed technique clearly outperforms a baseline system using a conventional RelAtive SpecTrA (RASTA) filter.
Chunyan LIANG Lin YANG Qingwei ZHAO Yonghong YAN
In this letter, we adopt a new factor analysis of neighborhood-preserving embedding (NPE) for speaker verification. NPE aims at preserving the local neighborhood structure on the data and defines a low-dimensional speaker space called neighborhood-preserving embedding space. We compare the proposed method with the state-of-the-art total variability approach on the telephone-telephone core condition of the NIST 2008 Speaker Recognition Evaluation (SRE) dataset. The experimental results indicate that the proposed NPE method outperforms the total variability approach, providing up to 24% relative improvement.
Yuuji MUKAI Hideki NODA Takashi OSANAI
This paper discusses speaker verification (SV) using Gaussian mixture models (GMMs), where only utterances of enrolled speakers are required. Such an SV system can be realized using artificially generated cohorts instead of real cohorts from speaker databases. This paper presents a rational approach to set GMM parameters for artificial cohorts based on statistics of GMM parameters for real cohorts. Equal error rates for the proposed method are about 10% less than those for the previous method, where GMM parameters for artificial cohorts were set in an ad hoc manner.
Xiang XIAO Xiang ZHANG Haipeng WANG Hongbin SUO Qingwei ZHAO Yonghong YAN
The GMM-UBM framework has been proved to be one of the most effective approaches to the automatic speaker verification (ASV) task in recent years. In this letter, we first propose an approximate decision function of traditional GMM-UBM, from which it is shown that the contribution to classification of each Gaussian component is equally important. However, research in speaker perception shows that a different speech sound unit defined by Gaussian component makes a different contribution to speaker verification. This motivates us to emphasize some sound units which have discriminability between speakers while de-emphasize the speech sound units which contain little information for speaker verification. Experiments on 2006 NIST SRE core task show that the proposed approach outperforms traditional GMM-UBM approach in classification accuracy.
Yuuji MUKAI Hideki NODA Michiharu NIIMI Takashi OSANAI
This paper presents a text-independent speaker verification method using Gaussian mixture models (GMMs), where only utterances of enrolled speakers are required. Artificial cohorts are used instead of those from speaker databases, and GMMs for artificial cohorts are generated by changing model parameters of the GMM for a claimed speaker. Equal error rates by the proposed method are about 60% less than those by a conventional method which also uses only utterances of enrolled speakers.
Taichi ASAMI Koji IWANO Sadaoki FURUI
We have previously proposed a noise-robust speaker verification method using fundamental frequency (F0) extracted using the Hough transform. The method also incorporates an automatic stream-weight and decision threshold estimation technique. It has been confirmed that the proposed method is effective for white noise at various SNR conditions. This paper evaluates the proposed method in more practical in-car and elevator-hall noise conditions. The paper first describes the noise-robust F0 extraction method and details of our robust speaker verification method using multi-stream HMMs for integrating the extracted F0 and cepstral features. Details of the automatic stream-weight and threshold estimation method for multi-stream speaker verification framework are also explained. This method simultaneously optimizes stream-weights and a decision threshold by combining the linear discriminant analysis (LDA) and the Adaboost technique. Experiments were conducted using Japanese connected digit speech contaminated by white, in-car, or elevator-hall noise at various SNRs. Experimental results show that the F0 features improve the verification performance in various noisy environments, and that our stream-weight and threshold optimization method effectively estimates control parameters so that FARs and FRRs are adjusted to achieve equal error rates (EERs) under various noisy conditions.
Toshiaki KAMADA Nobuaki MINEMATSU Takashi OSANAI Hisanori MAKINAE Masumi TANIMOTO
In forensic voice telephony speaker verification, we may be requested to identify a speaker in a very noisy environment, unlike the conditions in general research. In a noisy environment, we process speech first by clarifying it. However, the previous study of speaker verification from clarified speech did not yield satisfactory results. In this study, we experimented on speaker verification with clarification of speech in a noisy environment, and we examined the relationship between improving acoustic quality and speaker verification results. Moreover, experiments with realistic noise such as a crime prevention alarm and power supply noise was conducted, and speaker verification accuracy in a realistic environment was examined. We confirmed the validity of speaker verification with clarification of speech in a realistic noisy environment.
Jian LUAN Jie HAO Tomonari KAKINO Akinori KAWAMURA
DTW-based text-dependent speaker verification technology is an effective scheme for protecting personal information in personal electronic products from others. To enhance the performance of a DTW-based system, an impostor database covering all possible passwords is generally required for the matching scores normalization. However, it becomes impossible in our practical application scenario since users are not restricted in their choice of password. We propose a method to generate pseudo-impostor data by employing an acoustic codebook. Based on the pseudo-impostor data, two normalization algorithms are developed. Besides, a template compression approach based on the codebook is introduced. Some modifications to the conventional DTW global constraints are also made for the compressed template. Combining the normalization and template compression methods, we obtain more than 66% and 35% relative reduction in storage and EER, respectively. We expect that other DTW-based tasks may also benefit from our methods.
Javier R. SAETA Javier HERNANDO
The selection of the most representative utterances coming from a speaker is essential for the right performance of automatic enrollment in speaker verification. Model quality measures and threshold estimation methods mainly deal with the scarcity of data and the difficulty of obtaining data from impostors in real applications. Conventional methods estimate the quality of the training utterances once the model is created. In such case, it is not possible to ask the user for more utterances during the training session if necessary. A new training session must be started. That was especially unusable in applications where only one or two enrolment sessions were allowed. In this paper, a new on-line quality method based on a male and a female Universal Background Model (UBM) is introduced. The two models act as a reference for new utterances and show if they belong to the same speaker and provide a measure of its quality at the same time. On the other hand, the estimation of the verification threshold is also strongly influenced by the previous selection of the speaker's utterances. In this context, potential outliers, i.e., those client scores which are distant with regard to mean, could lead to wrong mean and variance client estimations. To alleviate this problem, some efficient threshold estimation methods based on removing or weighting scores are proposed here. Before estimating the threshold, the client scores catalogued as outliers are removed, pruned or weighted, improving subsequent estimations. Text-dependent experiments have been carried out by using a telephonic multi-session database in Spanish. The database has been recorded by the authors and has 184 speakers.
Mohamed Abdel FATTAH Fuji REN Shingo KUROIWA
In the European Telecommunication Standards Institute (ETSI), Distributed Speech Recognition (DSR) front-end, the distortion added due to feature compression on the front end side increases the variance flooring effect, which in turn increases the identification error rate. The penalty incurred in reducing the bit rate is the degradation in speaker recognition performance. In this paper, we present a nontraditional solution for the previously mentioned problem. To reduce the bit rate, a speech signal is segmented at the client, and the most effective phonemes (determined according to their type and frequency) for speaker recognition are selected and sent to the server. Speaker recognition occurs at the server. Applying this approach to YOHO corpus, we achieved an identification error rate (ER) of 0.05% using an average segment of 20.4% for a testing utterance in a speaker identification task. We also achieved an equal error rate (EER) of 0.42% using an average segment of 15.1% for a testing utterance in a speaker verification task.
Jan ANGUITA Javier HERNANDO Alberto ABAD
Jacobian Adaptation (JA) has been successfully used in Automatic Speech Recognition (ASR) systems to adapt the acoustic models from the training to the testing noise conditions. In this work we present an improvement of JA for speaker verification, where a specific training noise reference is estimated for each speaker model. The new proposal, which will be referred to as Model-dependent Noise Reference Jacobian Adaptation (MNRJA), has consistently outperformed JA in our speaker verification experiments.
Akio OGIHARA Hitoshi UNNO Akira SHIOZAKI
We propose discrimination method of synthetic speech using pitch pattern of speech signal. By applying the proposed synthetic speech discrimination system as pre-process before the conventional HMM speaker verification system, we can improve the safety of conventional speaker verification system against imposture using synthetic speech. The proposed method distinguishes between synthetic speech and natural speech according to the pitch pattern which is distribution of value of normalized short-range autocorrelation function. We performed the experiment of user verification, and confirmed the validity of the proposed method.
This paper presents a novel design of connected digit patterns to achieve high accuracy text-prompted speaker verification over a cellular phone network. To reduce the error rate, a phoneme-balanced connected digit pattern for enrollment, and digit-sequence-preserving connected digit patterns for verification (i.e. patterns preserving partial digit sequences of the enrollment pattern) are proposed. In addition to these, a decision procedure using multiple patterns has been designed to overcome the low quality of cellular phone speech. Experimental results on cellular phone speech showed the phoneme-balanced patterns for enrollment and digit-sequence-preserving patterns for verification reduced more than 50% of equal error rate compared to the conventional method using randomly-selected and randomly-reordered digit patterns. The decision procedure reduced 60% of the error rate. In addition, this paper shows that verification patterns depending on the pattern of a preceding utterance reduced 10% of the error rate. Overall, the error rate obtained by the proposed method was 1% for 99% of clients and 95% of impostors.
This paper investigates a new method for creating robust speaker models to cope with inter-session variation of a speaker in a continuous HMM-based speaker verification system. The new method estimates session-independent parameters by decomposing inter-session variations into two distinct parts: session-dependent and -independent. The parameters of the speaker models are estimated using the speaker adaptive training algorithm in conjunction with the equalization of session-dependent variation. The resultant models capture the session-independent speaker characteristics more reliably than the conventional models and their discriminative power improves accordingly. Moreover we have made our models more invariant to handset variations in a public switched telephone network (PSTN) by focusing on session-dependent variation and handset-dependent distortion separately. Text-independent speech data recorded by 20 speakers in seven sessions over 16 months was used to evaluate the new approach. The proposed method reduces the error rate by 15% relatively. When compared with the popular cepstral mean normalization, the error rate is reduced by 24% relatively when the speaker models were recreated using speech data recorded in four or more sessions.
Hideki NODA Katsuya HARADA Eiji KAWAGUCHI
This paper presents an improved method of speaker verification using the sequential probability ratio test (SPRT), which can treat the correlation between successive feature vectors. The hidden Markov model with the mean field approximation enables us to consider the correlation in the SPRT, i. e. , using the mean field of previous state, probability computation can be carried out as if input samples were independent each other.