In this study, we consider the data compression with side information available at both the encoder and the decoder. The information source is assigned to a variable-length code that does not have to satisfy the prefix-free constraints. We define several classes of codes whose codeword lengths and error probabilities satisfy worse-case criteria in terms of side-information. As a main result, we investigate the exact first-order asymptotics with second-order bounds scaled as Θ(√n) as blocklength n increases under the regime of nonvanishing error probabilities. To get this result, we also derive its one-shot bounds by employing the cutoff operation.
The achievability part of the rate-distortion theorem is proved by showing existence of good codes. For i.i.d. sources, two methods showing existence are known; random coding and non-random coding. For general sources, however, no proof in which good codes are constructed with non-random coding is found. In this paper, with a non-random method of code construction, we prove the achievability part of the rate-distortion theorem for general sources. Moreover, we also prove a stochastic variation of the rate-distortion theorem with the same method.
Kengo HASHIMOTO Ken-ichi IWATA
The class of k-bit delay decodable codes, source codes allowing decoding delay of at most k bits for k≥0, can attain a shorter average codeword length than Huffman codes. This paper discusses the general properties of the class of k-bit delay decodable codes with a finite number of code tables and proves two theorems which enable us to limit the scope of codes to be considered when discussing optimal k-bit delay decodable codes.
Koshi SHIMADA Shota SAITO Toshiyasu MATSUSHIMA
The context tree model has the property that the occurrence probability of symbols is determined from a finite past sequence and is a broader class of sources that includes i.i.d. or Markov sources. This paper proposes a non-stationary source with context tree models that change from interval to interval. The Bayes code for this source requires weighting of the posterior probabilities of the context tree models and change points, so the computational complexity of it usually increases to exponential order. Therefore, the challenge is how to reduce the computational complexity. In this paper, we propose a special class of prior probability distribution of context tree models and change points and develop an efficient Bayes coding algorithm by combining two existing Bayes coding algorithms. The algorithm minimizes the Bayes risk function of the proposed source in this paper, and the computational complexity of the proposed algorithm is polynomial order. We investigate the behavior and performance of the proposed algorithm by conducting experiments.
Tetsunao MATSUTA Tomohiko UYEMATSU
We consider the coding problem for lossy source coding with side information at the decoder, which is known as the Wyner-Ziv source coding problem. The goal of the coding problem is to find the minimum rate such that the probability of exceeding a given distortion threshold is less than the desired level. We give an equivalent expression of the minimum rate by using the chromatic number and notions of covering of a set. This allows us to analyze the coding problem in terms of graph coloring and covering.
In order to erase data including confidential information stored in storage devices, an unrelated and random sequence is usually overwritten, which prevents the data from being restored. The problem of minimizing the cost for information erasure when the amount of information leakage of the confidential information should be less than or equal to a constant asymptotically has been introduced by T. Matsuta and T. Uyematsu. Whereas the minimum cost for overwriting has been given for general sources, a single-letter characterization for stationary memoryless sources is not easily derived. In this paper, we give single-letter characterizations for stationary memoryless sources under two types of restrictions: one requires the output distribution of the encoder to be independent and identically distributed (i.i.d.) and the other requires it to be memoryless but not necessarily i.i.d. asymptotically. The characterizations indicate the relation among the amount of information leakage, the minimum cost for information erasure and the rate of the size of uniformly distributed sequences. The obtained results show that the minimum costs are different between these restrictions.
Tetsunao MATSUTA Tomohiko UYEMATSU
In this paper, we consider a source coding with side information partially used at the decoder through a codeword. We assume that there exists a relative delay (or gap) of the correlation between the source sequence and side information. We also assume that the delay is unknown but the maximum of possible delays is known to two encoders and the decoder, where we allow the maximum of delays to change by the block length. In this source coding, we give an inner bound and an outer bound on the achievable rate region, where the achievable rate region is the set of rate pairs of encoders such that the decoding error probability vanishes as the block length tends to infinity. Furthermore, we clarify that the inner bound coincides with the outer bound when the maximum of delays for the block length converges to a constant.
Takahiro OTA Hiroyoshi MORITA Akiko MANADA
The technique of lossless compression via substring enumeration (CSE) is a kind of enumerative code and uses a probabilistic model built from the circular string of an input source for encoding a one-dimensional (1D) source. CSE is applicable to two-dimensional (2D) sources, such as images, by dealing with a line of pixels of a 2D source as a symbol of an extended alphabet. At the initial step of CSE encoding process, we need to output the number of occurrences of all symbols of the extended alphabet, so that the time complexity increases exponentially when the size of source becomes large. To reduce computational time, we can rearrange pixels of a 2D source into a 1D source string along a space-filling curve like a Hilbert curve. However, information on adjacent cells in a 2D source may be lost in the conversion. To reduce the time complexity and compress a 2D source without converting to a 1D source, we propose a new CSE which can encode a 2D source in a block-by-block fashion instead of in a line-by-line fashion. The proposed algorithm uses the flat torus of an input 2D source as a probabilistic model instead of the circular string of the source. Moreover, we prove the asymptotic optimality of the proposed algorithm for 2D general sources.
Two kinds of problems - multiterminal hypothesis testing and one-to-many lossy source coding - are investigated in a unified way. It is demonstrated that a simple key idea, which is developed by Iriyama for one-to-one source coding systems, can be applied to multiterminal source coding systems. In particular, general bounds on the error exponents for multiterminal hypothesis testing and one-to-many lossy source coding are given.
Let X, Y be two correlated discrete random variables. We consider an estimation of X from encoded data φ(Y) of Y by some encoder function φ(Y). We derive an inequality describing a relation of the correct probability of estimation and the mutual information between X and φ(Y). This inequality may be useful for the secure analysis of crypto system when we use the success probability of estimating secret data as a security criterion. It also provides an intuitive meaning of the secrecy exponent in the strong secrecy criterion.
Junya HIRAMATSU Motohiko ISAKA
This letter presents numerical results of lossy source coding for non-uniformly distributed binary source with trellis codes. The results show how the performance of trellis codes approaches the rate-distortion function in terms of the number of states.
Tunstall code is known as an optimal variable-to-fixed length (VF) lossless source code under the criterion of average coding rate, which is defined as the codeword length divided by the average phrase length. In this paper we define the average coding rate of a VF code as the expectation of the pointwise coding rate defined by the codeword length divided by the phrase length. We call this type of average coding rate the average pointwise coding rate. In this paper, a new VF code is proposed. An incremental parsing tree construction algorithm like the one that builds Tunstall parsing tree is presented. It is proved that this code is optimal under the criterion of the average pointwise coding rate, and that the average pointwise coding rate of this code converges asymptotically to the entropy of the stationary memoryless source emitting the data to be encoded. Moreover, it is proved that the proposed code attains better worst-case coding rate than Tunstall code.
Shota SAITO Toshiyasu MATSUSHIMA
This letter treats the problem of lossless fixed-to-variable length source coding in moderate deviation regime. We investigate the behavior of the overflow probability of the Bayes code. Our result clarifies that the behavior of the overflow probability of the Bayes code is similar to that of the optimal non-universal code for i.i.d. sources.
In both theoretical analysis and practical use for an antidictionary coding algorithm, an important problem is how to encode an antidictionary of an input source. This paper presents a proposal for a compact tree representation of an antidictionary built from a circular string for an input source. We use a technique for encoding a tree in the compression via substring enumeration to encode a tree representation of the antidictionary. Moreover, we propose a new two-pass universal antidictionary coding algorithm by means of the proposal tree representation. We prove that the proposed algorithm is asymptotic optimal for a stationary ergodic source.
Shota SAITO Toshiyasu MATSUSHIMA
We treat lossless fixed-to-variable length source coding under general sources for finite block length setting. We evaluate the threshold of the overflow probability for prefix and non-prefix codes in terms of the smooth max-entropy. We clarify the difference of the thresholds between prefix and non-prefix codes for finite block length. Further, we discuss our results under the asymptotic block length setting.
Tetsunao MATSUTA Tomohiko UYEMATSU
In this paper, we deal with the fixed-length lossy compression, where a fixed-length sequence emitted from the information source is encoded into a codeword, and the source sequence is reproduced from the codeword with a certain distortion. We give lower and upper bounds on the minimum number of codewords such that the probability of exceeding a given distortion level is less than a given probability. These bounds are characterized by using the α-mutual information of order infinity. Further, for i.i.d. binary sources, we provide numerical examples of tight upper bounds which are computable in polynomial time in the blocklength.
Average coding rate of a multi-shot Tunstall code, which is a variation of variable-to-fixed length (VF) lossless source codes, for stationary memoryless sources is investigated. A multi-shot VF code parses a given source sequence to variable-length blocks and encodes them to fixed-length codewords. If we consider the situation that the parsing count is fixed, overall multi-shot VF code can be treated as a one-shot VF code. For this setting of Tunstall code, the compression performance is evaluated using two criterions. The first one is the average coding rate which is defined as the codeword length divided by the average block length. The second one is the expectation of the pointwise coding rate. It is proved that both of the above average coding rate converge to the entropy of a stationary memoryless source under the assumption that the geometric mean of the leaf counts of the multi-shot Tunstall parsing trees goes to infinity.
Ken-ichi IWATA Mitsuharu ARIMURA
A generalization of compression via substring enumeration (CSE) for k-th order Markov sources with a finite alphabet is proposed, and an upper bound of the codeword length of the proposed method is presented. We analyze the worst case maximum redundancy of CSE for k-th order Markov sources with a finite alphabet. The compression ratio of the proposed method asymptotically converges to the optimal one for k-th order Markov sources with a finite alphabet if the length n of a source string tends to infinity.
Shota SAITO Nozomi MIYA Toshiyasu MATSUSHIMA
This paper considers universal lossless variable-length source coding problem and investigates the Bayes code from viewpoints of the distribution of its codeword lengths. First, we show that the codeword lengths of the Bayes code satisfy the asymptotic normality. This study can be seen as the investigation on the asymptotic shape of the distribution of codeword lengths. Second, we show that the codeword lengths of the Bayes code satisfy the law of the iterated logarithm. This study can be seen as the investigation on the asymptotic end points of the distribution of codeword lengths. Moreover, the overflow probability, which represents the bottom of the distribution of codeword lengths, is studied for the Bayes code. We derive upper and lower bounds of the infimum of a threshold on the overflow probability under the condition that the overflow probability does not exceed ε∈(0,1). We also analyze the necessary and sufficient condition on a threshold for the overflow probability of the Bayes code to approach zero asymptotically.
Yohei ONISHI Hidaka KINUGASA Takashi MURAKI Motohiko ISAKA
We present numerical results on the rate-distortion performance of convolutional coding for the binary symmetric source, and show how convolutional codes approach the rate-distortion bound by increasing the trellis states.