Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques

Kuan-Yu CHEN; Hsin-Min WANG; Berlin CHEN

doi:10.1587/transinf.E95.D.1195

Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques

Kuan-Yu CHEN, Hsin-Min WANG, Berlin CHEN

Full Text Views

0

Share
Cite this

Summary :

This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

Publication: IEICE TRANSACTIONS on Information Vol.E95-D No.5 pp.1195-1205

Publication Date: 2012/05/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E95.D.1195

Type of Manuscript: Special Section PAPER (Special Section on Recent Advances in Multimedia Signal Processing Techniques and Applications)

Category: Speech Processing

Cite this

Copy

Kuan-Yu CHEN, Hsin-Min WANG, Berlin CHEN, "Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques" in IEICE TRANSACTIONS on Information, vol. E95-D, no. 5, pp. 1195-1205, May 2012, doi: 10.1587/transinf.E95.D.1195.
Abstract: This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.E95.D.1195/_p

Copy

@ARTICLE{e95-d_5_1195,
author={Kuan-Yu CHEN, Hsin-Min WANG, Berlin CHEN, },
journal={IEICE TRANSACTIONS on Information},
title={Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques},
year={2012},
volume={E95-D},
number={5},
pages={1195-1205},
abstract={This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.},
keywords={},
doi={10.1587/transinf.E95.D.1195},
ISSN={1745-1361},
month={May},}

Copy

TY - JOUR
TI - Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques
T2 - IEICE TRANSACTIONS on Information
SP - 1195
EP - 1205
AU - Kuan-Yu CHEN
AU - Hsin-Min WANG
AU - Berlin CHEN
PY - 2012
DO - 10.1587/transinf.E95.D.1195
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E95-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2012
AB - This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.
ER -