This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Jae Young LEE, Keong Mo SUNG, "Modification of LZSS by Using Structures of Hangul Characters for Hangul Text Compression" in IEICE TRANSACTIONS on Fundamentals,
vol. E79-A, no. 11, pp. 1904-1910, November 1996, doi: .
Abstract: This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.
URL: https://globals.ieice.org/en_transactions/fundamentals/10.1587/e79-a_11_1904/_p
Copy
@ARTICLE{e79-a_11_1904,
author={Jae Young LEE, Keong Mo SUNG, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Modification of LZSS by Using Structures of Hangul Characters for Hangul Text Compression},
year={1996},
volume={E79-A},
number={11},
pages={1904-1910},
abstract={This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.},
keywords={},
doi={},
ISSN={},
month={November},}
Copy
TY - JOUR
TI - Modification of LZSS by Using Structures of Hangul Characters for Hangul Text Compression
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1904
EP - 1910
AU - Jae Young LEE
AU - Keong Mo SUNG
PY - 1996
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E79-A
IS - 11
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - November 1996
AB - This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.
ER -