1-3hit |
Shigeru YOSHIDA Takashi MORIHARA Hironori YAHAGI Noriko ITANI
16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
Hongyuan CHEN Masato KITAKAMI Eiji FUJIWARA
One of the disadvantages of compressed data is their vulnerability, that is, even a single corrupted bit in compressed data may destroy the decompressed data completely. Therefore, Variable-to-Fixed length Arithmetic Coding, or VFAC, with error detecting capability is discussed. However, implementable error recovery method for compressed data has never been proposed. This paper proposes Burst Error Recovery Variable-to-Fixed length Arithmetic Coding, or BERVFAC, as well as Error Detecting Variable-to-Fixed length Arithmetic Coding, or EDVFAC. Both VFAC schemes achieve VF coding by inserting the internal states of the decompressor into compressed data. The internal states consist of width and offset of the sub-interval corresponding to the decompressed symbol and are also used for error detection. Convolutional operations are applied to encoding and decoding in order to propagate errors and improve error control capability. The proposed EDVFAC and BERVFAC are evaluated by theoretical analysis and computer simulations. The simulation results show that more than 99.99% of errors can be detected by EDVFAC. For BERVFAC, over 99.95% of l-burst errors can be corrected for l 32 and greater than 99.99% of other errors can be detected. The simulation results also show that the time-overhead necessary to decode the BERVFAC is about 12% when 10% of the received words are erroneous.
This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.