Sie sind auf Seite 1von 41

Chapter 5

Data Compression

Peng-Hua Wang

Graduate Inst. of Comm. Engineering

National Taipei University


Chapter Outline

Chap. 5 Data Compression


5.1 Example of Codes
5.2 Kraft Inequality
5.3 Optimal Codes
5.4 Bound on Optimal Code Length
5.5 Kraft Inequality for Uniquely Decodable Codes
5.6 Huffman Codes
5.7 Some Comments on Huffman Codes
5.8 Optimality of Huffman Codes
5.9 Shannon-Fano-Elias Coding
5.10 Competitive Optimality of the Shannon Code
5.11 Generation of Discrete Distributions from Fair Coins
Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 2/41
5.1 Example of Codes

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 3/41


Source code

Definition (Source code) A source code C for a random variable X is


a mapping from X , the range of X , to D ∗ , the set of finite-length
strings of symbols from a D -ary alphabet. Let C(x) denote the
codeword corresponding to x and let l(x) denote the length of C(x).
■ For example, C(red)
= 00, C(blue) = 11 is a source code with
mapping from X = {red, blue} to D 2 with alphabet D = {0, 1}.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 4/41


Source code

Definition (Expected length) The expected length L(C) of a source


code C(x) for a random variable X with probability mass function p(x)
is given by
X
L(C) = p(x)l(x).
x∈X

where l(X) is the length of the codeword associated with X .

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 5/41


Example

Example 5.1.1 Let X be a random variable with the following


distribution and codeword assignment

Pr{X = 1} = 12 , codeword C(1) = 0


Pr{X = 2} = 14 , codeword C(2) = 10
Pr{X = 3} = 18 , codeword C(3) = 110
Pr{X = 4} = 18 , codeword C(4) = 111
■ H(X) = 1.75 bits.
■ El(x) = 1.75 bits.
■ uniquely decodable

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 6/41


Example

Example 5.1.2 Consider following example.

Pr{X = 1} = 13 , codeword C(1) = 0


Pr{X = 2} = 13 , codeword C(2) = 10
Pr{X = 3} = 13 , codeword C(3) = 11
■ H(X) = 1.58 bits.
■ El(x) = 1.66 bits.
■ uniquely decodable

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 7/41


Source code

Definition (non-singular) A code is said to be nonsingular if every


element of the range of X maps into a different string in D ∗ ; that is,

x 6= x′ ⇒ C(x) 6= C(x′ )

Definition (extension code) The extension C ∗ of a code C is the


mapping from finite length-strings of X to finite-length strings of D ,
defined by

C(x1 x2 · · · xn ) = C(x1 )C(x2 ) · · · C(xn )


where C(x1 )C(x2 ) · · · C(xn ) indicates concatenation of the
corresponding codewords.
Example 5.1.4 If C(x1 ) = 00 and C(x2 ) = 11, then C(x1 x2 ) = 0011.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 8/41


Source code

Definition (uniquely decodable) A code is called uniquely decodable if


its extension is nonsingular.

Definition (prefix code) A code is called a prefix code or an


instantaneous code if no codeword is a prefix of any other codeword.
■ For an instantaneous code, the symbol xi can be decoded as soon
as we come to the end of the codeword corresponding to it.
■ For example, the binary string 01011111010 produced by the code
of Example 5.1.1 is parsed as 0, 10, 111, 110, 10.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 9/41


Source code

X Singular Nonsingular, but not UD,

UD But Not Inst. Inst.

1 0 0 10 0

2 0 010 00 10

3 0 01 11 110

4 0 10 110 111

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 10/41


Decoding Tree

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 11/41


5.2 Kraft Inequality

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 12/41


Kraft Inequality

Theorem 5.2.1 (Kraft Inequality) For any instantaneous code (prefix


code) over an alphabet of size D , the codeword lengths l1 , l2 , . . . , lm
must satisfy the inequality
X
D−li ≤ 1.
i

Conversely, given a set of codeword lengths that satisfy this inequality,


there exists an instantaneous code with these word lengths.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 13/41


Extended Kraft Inequality

Theorem 5.2.2 (Extended Kraft Inequality) For any countably infinite


set of codewords that form a prefix code, the codeword lengths satisfy
the extended Kraft inequality,

X
D−li ≤ 1.
i=1

Conversely, given any l1 , l2 , . . . satisfying the extended Kraft inequality,


we can construct a prefix code with these codeword lengths.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 14/41


5.3 Optimal Codes

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 15/41


Minimize expected length

Problem Given the source pmf pP1 , p2 , . . . pm , find the code length
l1 , l2 , . . . , lm such that the expected code length is minimized
X
L= pi li

with constraint
X
D−li ≤ 1.

■ l1 , l2 , . . . , lm are integers.
■ We first relax the original integer programming problem. The
restriction of integer length is relaxed to real number.
■ Solve by Lagrange multipliers.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 16/41


Solve the relaxed problem

X X
−li
 ∂J
J= pi li + λ D , = pi − λD−li ln D
∂li
∂J −li pi
=0⇒D =
∂li λ ln D
X
−li 1
D ≤1⇒λ= ⇒ pi = D−li
ln D
⇒ optimal code length li∗ = − logD pi 

The expected code length is


X X

L = pi li∗ =− pi logD pi = HD (X)

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 17/41


Expected code length

Theorem 5.3.1 The expected length L of any instantaneous D -ary


code for a random variable X is greater than or equal to the entropy
HD (X); that is,
L ≥ HD (X)
with equality if and only if D −li = pi .

Proof.
X X
L − HD (X) = pi l i + pi logD pi
X −li
X
=− pi logD D + pi logD pi
= D(p||q) ≥ 0

where qi = D−li .
D−li ≤ 1 may not be a valid distribution.
P
WRONG!! Because

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 18/41


Expected code length

Proof.
X X
L − HD (X) = pi li + pi logD pi
X X
−li
=− pi logD D + pi logD pi
−lj −li
P
Let c = j D , ri = D /c,
X pi
L − HD (X) = pi logD − logD c
ri
1
= D(p||r) + logD ≥ 0
c
since D(p||r) ≥ 0 and c ≤ 1 by Kraft inequality. L ≤ HD (X) with
equality iff pi = D −li . That is, iff − logD pi is an integer for all i. 

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 19/41


D-adic

Definition (D -adic) A probability distribution is called D -adic if each of


the probabilities is equal to D −n for some n.
■ L = HD (X) if and only if the distribution of X is D-adic.
■ How to find the optimal code? ⇒ Find the D-adic distribution that is
closest (in the relative entropy sense) to the distribution of X .
■ What is the upper bound of the optimal code ?

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 20/41


5.4 Bound on Optimal Code Length

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 21/41


Optimal code length

Theorem 5.4.1 Let l1∗ , l2∗ , . . . , lm



be optimal codeword lengths for a
source distribution p and sa D -ary alphabet, and let L∗ be the
|
pi li∗ ).
P
associated expected length of an optimal code (L ast =
Then
HD (X) ≤ L∗ < HD (X) + 1.
Proof. Let li = ⌈logD p1 ⌉ where ⌈x⌉ is the smallest integer ≥ x.
i
These lengths satisfy the Kraft inequality since
1 1
−⌈logD ⌉ − logD
X X
D pi
≤D pi
= pi = 1.

These choice of codeword lengths satisfies


1 1
logD ≤ li ≤ logD + 1.
pi pi

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 22/41


Optimal code length

Multiplying by pi and summing over i, we obtain

HD (X) ≤ L < HD (X) + 1.

Since L∗ is the expected length of the optimal code,

L∗ ≤ L < HD (X) + 1.
On another hand, from Theorem 5.3.1,

L∗ ≥ HD (X).

Therefore,
HD (X) ≤ L∗ < HD (X) + 1. 

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 23/41


Optimal code length

Consider a system in which we send a sequence of n symbols from X .


Define Ln to be the expected codeword length per input symbol,
1X
Ln = p(x1 , x2 , . . . , xn )l(x1 , x2 , . . . , xn )
n
1
= E[l(X1 , X2 , . . . , Xn )]
n
We have

H(X1 , X2 , . . . , Xn ) ≤ E[l(X1 , X2 , . . . , Xn )]
< H(X1 , X2 , . . . , Xn ) + 1

If X1 , X2 , . . . , Xn are i.i.d., we have


1
H(X) ≤ Ln < H(X) +
n
Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 24/41
Optimal code length

Consider a system in which we send a sequence of n symbols from X .


Define Ln to be the expected codeword length per input symbol,
1X
Ln = p(x1 , x2 , . . . , xn )l(x1 , x2 , . . . , xn )
n
1
= E[l(X1 , X2 , . . . , Xn )]
n
We have

H(X1 , X2 , . . . , Xn ) ≤ E[l(X1 , X2 , . . . , Xn )]
< H(X1 , X2 , . . . , Xn ) + 1

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 25/41


Optimal code length

If X1 , X2 , . . . , Xn are independent but not identically distributed, we


have
1 1
H(X1 , X2 , . . . , Xn ) ≤ Ln < H(X1 , X2 , . . . , Xn ) + 1
n n
If the random process is stationary
1
H(X1 , X2 , . . . , Xn ) → H(X )
n

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 26/41


Optimal code length

Theorem 5.4.2 The minimum expected codeword length per symbol


satisfies
1 ∗ 1
H(X1 , X2 , . . . , Xn ) ≤ Ln < H(X1 , X2 , . . . , Xn ) + 1
n n
If X1 , X2 , . . . , Xn is a stationary random process,

L∗n → H(X )
where H(X ) is the entropy rate of the random process.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 27/41


Wrong Code

Theorem 5.4.3 (Wrong Code) The expected length under p(x) of the
1
code assignment l(x) = ⌈log q(x) ⌉ satisfies

H(p) + D(p||q) ≤ Ep [l(X)] < H(p) + D(p||q) + 1.


Proof. The expected codelength is
   
X 1 X 1
E[l(X)] = p(x) log < p(x) log +1
x
q(x) x
q(x)
X 1
= p(x) log +1
x
q(x)
 
X p(x)
= p(x) log − log p(x) + 1
x
q(x)
= H(p) + D(p||q) + 1

The lower bound can be derived similarly. 

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 28/41


5.5 Kraft Inequality for Uniquely Decodable
Codes

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 29/41


Uniquely Decodable Codes

Theorem 5.5.1 (Wrong Code) The codeword lengths of any uniquely


decodable D -ary code must satisfy the Kraft inequality
X
D−li ≤ 1.

Conversely, given a set of codeword lengths that satisfy this inequality, it


is possible to construct a uniquely decodable code with these codeword
lengths.

Proof. Consider
!k
X XX X
D −l(x)
= ··· D−l(x1 ) D−l(x2 ) · · · D−l(xk )
x x1 x2 xk

−l(xk )
X −l(x1 ) −l(x2 ) −l(xk )
X
= D D ···D = D
x1 ,...,xk ∈X k xk ∈X k

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 30/41


Uniquely Decodable Codes

We now gather the terms by word lengths to obtain


klmax
−l(xk )
X X
D = a(m)D−m
xk ∈X k m=1

where lmax is the maximum codeword length and a(m) is the number of
source sequences mapping into codewords of length m.

Since the code is uniquely decodable, so there is at most one sequence


mapping into each code m-sequence and there are at most D m code
m-sequences. Thus,
a(m) ≤ Dm .

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 31/41


Uniquely Decodable Codes

Therefore,

!k klmax klmax
X X X
D−l(x) = a(m)D−m ≤ Dm D−m = klmax
x m=1 m=1
or

X
D−lj ≤ (klmax )1/k
j

Since this inequality is true for all k , it is true in the limit as k → ∞.


Since (klmax )1/k → 1, we have
X
D−lj ≤ 1. 
j

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 32/41


5.6 Huffman Codes

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 33/41


Example 5.6.1

X = {1, 2, 3, 4, 5}, p = {0.25, 0.25, 0.2, 0.15, 0.15}, binary.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 34/41


Example 5.6.2

X = {1, 2, 3, 4, 5}, p = {0.25, 0.25, 0.2, 0.15, 0.15}, ternary.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 35/41


Example 5.6.3

X = {1, 2, 3, 4, 5, 6}, p = {0.25, 0.25, 0.2, 0.1, 0.1, 0.1}, ternary.


■ At k th stage, the total number of symbols is 1 + k(D − 1).

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 36/41


Shannon Code

■ Using codeword lengths of ⌈log p1 ⌉


i

■ May be much worse than the optimal code. For example,


p1 = 1 − 1/1024 and p2 = 1/1024. ⌈log p11 ⌉ = 1 and
⌈log p12 ⌉ = 10. However, we can use exactly 1 bit.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 37/41


5.8 Optimality of Huffman Codes

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 38/41


Properties of Optimal Codes

Lemma 5.8.1 For any distribution, there exists an optimal instantaneous


code that satisfies the following properties:
1. If pj > pk , then lj ≤ lk .
2. The two longest codewords have the same length.
3. Two of the longest codewords differ only in the last bits.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 39/41


Optimality of Huffman codes

Lemma 5.8.2 Let C be the codes with distribution


p1 ≤ p2 ≤ · · · ≤ pK−1 ≤ pK . C ′ is the codes with distribution
p1 , p2 , . . . , pK−1 + pK . If C ′ is optimal with code assignment

p1 → w1 , p2 → w2 , . . . , pK−1 + pK → wK−1 ,

then C is also optimal with code assignment

p1 → w1 , p2 → w2 , . . . , pK−1 → wK−1 0, pK → wK−1 1.

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 40/41


Optimality of Huffman codes

Proof. The average length for C ′ is

L(C ′ ) = p1 l1 + p2 l2 + · · · + (pK−1 + pK )lK−1 .

The average length for C is

L(C) = p1 l1 + p2 l2 + · · · + pK−1 (lK−1 + 1) + pK (lK−1 + 1).


We have
L(C) = L(C ′ ) + pK−1 + pK .
That is, we can minimize L(C) by minimizing L(C ′ ) since pK−1 + pK
is a constant. 

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 41/41

Das könnte Ihnen auch gefallen