Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment

Osamu ICHIKAWA; Takashi FUKUDA; Masafumi NISHIMURA

doi:10.1093/ietisy/e91-d.3.635

Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment

Osamu ICHIKAWA, Takashi FUKUDA, Masafumi NISHIMURA

Full Text Views

0

Share
Cite this

Summary :

The accuracy of automatic speech recognition in a car is significantly degraded in a very low SNR (Signal to Noise Ratio) situation such as "Fan high" or "Window open". In such cases, speech signals are often buried in broadband noise. Although several existing noise reduction algorithms are known to improve the accuracy, other approaches that can work with them are still required for further improvement. One of the candidates is enhancement of the harmonic structures in human voices. However, most conventional approaches are based on comb filtering, and it is difficult to use them in practical situations, because their assumptions for F0 detection and for voiced/unvoiced detection are not accurate enough in realistic noisy environments. In this paper, we propose a new approach that does not rely on such detection. An observed power spectrum is directly converted into a filter for speech enhancement, by retaining only the local peaks considered to be harmonic structures in the human voice. In our experiments, this approach reduced the word error rate by 17% in realistic automobile environments. Also, it showed further improvement when used with existing noise reduction methods.

Publication: IEICE TRANSACTIONS on Information Vol.E91-D No.3 pp.635-639

Publication Date: 2008/03/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e91-d.3.635

Type of Manuscript: Special Section LETTER (Special Section on Robust Speech Processing in Realistic Environments)

Category

Cite this

Copy

Osamu ICHIKAWA, Takashi FUKUDA, Masafumi NISHIMURA, "Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment" in IEICE TRANSACTIONS on Information, vol. E91-D, no. 3, pp. 635-639, March 2008, doi: 10.1093/ietisy/e91-d.3.635.
Abstract: The accuracy of automatic speech recognition in a car is significantly degraded in a very low SNR (Signal to Noise Ratio) situation such as "Fan high" or "Window open". In such cases, speech signals are often buried in broadband noise. Although several existing noise reduction algorithms are known to improve the accuracy, other approaches that can work with them are still required for further improvement. One of the candidates is enhancement of the harmonic structures in human voices. However, most conventional approaches are based on comb filtering, and it is difficult to use them in practical situations, because their assumptions for F0 detection and for voiced/unvoiced detection are not accurate enough in realistic noisy environments. In this paper, we propose a new approach that does not rely on such detection. An observed power spectrum is directly converted into a filter for speech enhancement, by retaining only the local peaks considered to be harmonic structures in the human voice. In our experiments, this approach reduced the word error rate by 17% in realistic automobile environments. Also, it showed further improvement when used with existing noise reduction methods.
URL: https://globals.ieice.org/en_transactions/information/10.1093/ietisy/e91-d.3.635/_p

Copy

@ARTICLE{e91-d_3_635,
author={Osamu ICHIKAWA, Takashi FUKUDA, Masafumi NISHIMURA, },
journal={IEICE TRANSACTIONS on Information},
title={Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment},
year={2008},
volume={E91-D},
number={3},
pages={635-639},
abstract={The accuracy of automatic speech recognition in a car is significantly degraded in a very low SNR (Signal to Noise Ratio) situation such as "Fan high" or "Window open". In such cases, speech signals are often buried in broadband noise. Although several existing noise reduction algorithms are known to improve the accuracy, other approaches that can work with them are still required for further improvement. One of the candidates is enhancement of the harmonic structures in human voices. However, most conventional approaches are based on comb filtering, and it is difficult to use them in practical situations, because their assumptions for F0 detection and for voiced/unvoiced detection are not accurate enough in realistic noisy environments. In this paper, we propose a new approach that does not rely on such detection. An observed power spectrum is directly converted into a filter for speech enhancement, by retaining only the local peaks considered to be harmonic structures in the human voice. In our experiments, this approach reduced the word error rate by 17% in realistic automobile environments. Also, it showed further improvement when used with existing noise reduction methods.},
keywords={},
doi={10.1093/ietisy/e91-d.3.635},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment
T2 - IEICE TRANSACTIONS on Information
SP - 635
EP - 639
AU - Osamu ICHIKAWA
AU - Takashi FUKUDA
AU - Masafumi NISHIMURA
PY - 2008
DO - 10.1093/ietisy/e91-d.3.635
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E91-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2008
AB - The accuracy of automatic speech recognition in a car is significantly degraded in a very low SNR (Signal to Noise Ratio) situation such as "Fan high" or "Window open". In such cases, speech signals are often buried in broadband noise. Although several existing noise reduction algorithms are known to improve the accuracy, other approaches that can work with them are still required for further improvement. One of the candidates is enhancement of the harmonic structures in human voices. However, most conventional approaches are based on comb filtering, and it is difficult to use them in practical situations, because their assumptions for F0 detection and for voiced/unvoiced detection are not accurate enough in realistic noisy environments. In this paper, we propose a new approach that does not rely on such detection. An observed power spectrum is directly converted into a filter for speech enhancement, by retaining only the local peaks considered to be harmonic structures in the human voice. In our experiments, this approach reduced the word error rate by 17% in realistic automobile environments. Also, it showed further improvement when used with existing noise reduction methods.
ER -