Author Search Result

[Author] Canasai KRUENGKRAI(4hit)

1-4hit
  • Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model

    Canasai KRUENGKRAI  Kiyotaka UCHIMOTO  Jun'ichi KAZAMA  Yiou WANG  Kentaro TORISAWA  Hitoshi ISAHARA  

     
    PAPER-Morphological/Syntactic Analysis

      Vol:
    E92-D No:12
      Page(s):
    2298-2305

    In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an error-driven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.

  • An EM-Based Approach for Mining Word Senses from Corpora

    Thatsanee CHAROENPORN  Canasai KRUENGKRAI  Thanaruk THEERAMUNKONG  Virach SORNLERTLAMVANICH  

     
    PAPER-Natural Language Processing

      Vol:
    E90-D No:4
      Page(s):
    775-782

    Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.

  • Statistical-Based Approach to Non-segmented Language Processing

    Virach SORNLERTLAMVANICH  Thatsanee CHAROENPORN  Shisanu TONGCHIM  Canasai KRUENGKRAI  Hitoshi ISAHARA  

     
    PAPER

      Vol:
    E90-D No:10
      Page(s):
    1565-1573

    Several approaches have been studied to cope with the exceptional features of non-segmented languages. When there is no explicit information about the boundary of a word, segmenting an input text is a formidable task in language processing. Not only the contemporary word list, but also usages of the words have to be maintained to cover the use in the current texts. The accuracy and efficiency in higher processing do heavily rely on this word boundary identification task. In this paper, we introduce some statistical based approaches to tackle the problem due to the ambiguity in word segmentation. The word boundary identification problem is then defined as a part of others for performing the unified language processing in total. To exhibit the ability in conducting the unified language processing, we selectively study the tasks of language identification, word extraction, and dictionary-less search engine.

  • Construction of Thai Lexicon from Existing Dictionaries and Texts on the Web

    Thatsanee CHAROENPORN  Canasai KRUENGKRAI  Thanaruk THEERAMUNKONG  Virach SORNLERTLAMVANICH  

     
    PAPER-Natural Language Processing

      Vol:
    E89-D No:7
      Page(s):
    2286-2293

    A lexicon is an important linguistic resource needed for both shallow and deep language processing. Currently, there are few machine-readable Thai dictionaries available, and most of them do not satisfy the computational requirements. This paper presents the design of a Thai lexicon named the TCL's Computational Lexicon (TCLLEX) and proposes a method to construct a large-scale Thai lexicon by re-using two existing dictionaries and a large number of texts on the Internet. In addition to morphological, syntactic, semantic case role and logical information in the existing dictionaries, a sort of semantic constraint called selectional preference is automatically acquired by analyzing Thai texts on the web and then added into the lexicon. In the acquisition process of the selectional preferences, the so-called Bayesian Information Criterion (BIC) is applied as the measure in a tree cut model. The experiments are done to verify the feasibility and effectiveness of obtained selection preferences.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.