Sie sind auf Seite 1von 45

ENTROPY & RUN LENGTH

CODING
Contents
• What is Entropy coding?
• Huffman Encoding
• Huffman encoding Example
• Arithmetic coding
• Encoding Algorithms for arithmetic coding
• Decoding Algorithm for Arithmetic decoding
• Run Length Encoding
• Question –Answer
• References
What is Entropy Coding?
Entropy coding is lossless compression
scheme.

One of the main types of entropy coding


creates and assigns a unique prefix-free code
to each unique symbol that occurs in the input.

These entropy encoders then compress data by


replacing each fixed-length input symbol with
the corresponding variable-length prefix-free
output code word.
Continue……
The length of each code word is approximately
proportional to the negative logarithm of the
probability.
Therefore, the most common symbols use the
shortest codes.

According to Shannon's source coding theorem,


the optimal code length for a symbol is −logbP,
where b is the number of symbols used to make
output codes and P is the probability of the input
symbol
Entropy Encoding Techniques-
Huffman Coding

Arithmetic coding
Huffman Encoding-
For each encoding unit (letter, symbol or any
character), associate with a frequency.
You can choose percentage or probability for
occurrence of the encoding unit.
Create a binary tree whose children are the encoding
units with the smallest frequencies/ probabilities.
The frequency of the root is the sum of the
frequencies/probabilities of the leaves
Repeat this procedure until all the encoding units are
covered in the binary tree.
Example, step I
 Assume that relative frequencies are:
 A: 40
 B: 20
 C: 10
 D: 10
 R: 20
 (I chose simpler numbers than the real frequencies)
 Smallest number are 10 and 10 (C and D), so connect those
Example, step II
C and D have already been used, and the new node
above them (call it C+D) has value 20
The smallest values are B, C+D, and R, all of which
have value 20
Connect any two of these
Example, step III
The smallest values is R, while A and B+C+D all
have value 40
Connect R to either of the others
Example, step IV
Connect the final two nodes
Example, step V
 Assign 0 to left branches, 1 to right branches
 Each encoding is a path from the root
 A=0
B = 100
C = 1010
D = 1011
R = 11
 Each path
terminates at a
leaf
 Do you see
why encoded
strings are
decodable?
Unique prefix property
 A=0
B = 100
C = 1010
D = 1011
R = 11
 No bit string is a prefix of any other bit string
 For example, if we added E=01, then A (0) would be a
prefix of E
 Similarly, if we added F=10, then it would be a prefix of
three other encodings (B=100, C=1010, and D=1011)
 The unique prefix property holds because, in a binary tree,
a leaf is not on a path to any other node
Data compression-
 Huffman encoding is a simple example of data
compression: representing data in fewer bits than it
would otherwise need
 A more sophisticated method is GIF (Graphics
Interchange Format) compression, for .gif files
 Another is JPEG (Joint Photographic Experts Group), for
.jpg files
 Unlike the others, JPEG is lossy—it loses information
 Generally OK for photographs (if you don’t compress
them too much), because decompression adds “fake”
data very similiar to the original
Arithmetic Coding
Arithmetic Coding-
Huffman coding has been proven the best in compare
to fixed length coding method available.
Yet, since Huffman codes have to be an integral
number of bits long, while the entropy value of a
symbol may (as a matter of fact, almost always so) be
a fraction number, theoretical possible compressed
message cannot be achieved.

15
Arithmetic Coding(Cont…)
For example, if a statistical method assign 90%
probability to a given character, the optimal code size
would be 0.15 bits.
The Huffman coding system would probably assign a
1-bit code to the symbol, which is six times longer
than necessary.

16
Arithmetic Coding(Cont..)
Arithmetic coding bypasses the idea of
replacing an input symbol with a specific code.
It replaces a stream of input symbols with a
single floating point output number.

17
Character probability Range
^(space) 1/10
A 1/10
B 1/10
E 1/10
G 1/10
I 1/10
L 2/10
S 1/10
T 1/10

Suppose that we want to encode the message


“BILL GATES”
18
Encoding Algorithm For Arithmetic
Coding-
 Encoding algorithm for arithmetic coding:
low = 0.0 ; high =1.0 ;
while not EOF do
range = high - low ;
read(c) ;
high = low + range×high_range(c) ;
low = low + range×low_range(c) ;
end do
output(low);

19
Continue……………….
To encode the first character B properly, the final
coded message has to be a number greater than or
equal to 0.20 and less than 0.30.
 range = 1.0 – 0.0 = 1.0
 high = 0.0 + 1.0 × 0.3 = 0.3
 low = 0.0 + 1.0 × 0.2 = 0.2
After the first character is encoded, the low end for
the range is changed from 0.00 to 0.20 and the high
end for the range is changed from 1.00 to 0.30.

20
Continue…………..
The next character to be encoded, the letter I, owns
the range 0.50 to 0.60 in the new sub range of 0.20 to
0.30.
So, the new encoded number will fall somewhere in
the 50th to 60th percentile of the currently established.
Thus, this number is further restricted to 0.25 to 0.26.

21
Continue……………………….
Note that any number between 0.25 and 0.26 is
a legal encoding number of ‘BI’. Thus, a
number that is best suited for binary
representation is selected.

(Condition : the length of the encoded message


is known or EOF is used.)

22
1.0 0.3 0.26 0.258 0.2576 0.25724 0.2572168 0.257216776
0.25722 0.2572168

T T T T T T T T T T
0.9 0.2572167756
S S S S S S S S S S
0.8
0.2572167752
L L L L L L L L L L

0.6
I I I I I I I I I I
0.5
G G G G G G G G G G
0.4
E E E E E E E E E E
0.3
B B B B B B B B B B
0.2
A A A A A A A A A A
0.1
() () () () () () () () () ()

0.257216 0.25721676 23
0.0 0.2 0.25 0.256 0.2572 0.2572 0.2572164 0.257216772
Continue……………..

Character Low High


B 0.2 0.3
I 0.25 0.26
L 0.256 0.258
L 0.2572 0.2576
^(space) 0.25720 0.25724
G 0.257216 0.257220
A 0.2572164 0.2572168
T 0.25721676 0.2572168
E 0.257216772 0.257216776
S 0.2572167752 0.2572167756
24
Continue……………….
So, the final value 0.2572167752 (or, any value
between 0.2572167752 and 0.2572167756, if the
length of the encoded message is known at the decode
end), will uniquely encode the message ‘BILL
GATES’.

25
Arithmetic Coding(Decoding)
Decoding is the inverse process.
Since 0.2572167752 falls between 0.2 and 0.3, the
first character must be ‘B’.
Removing the effect of ‘B’from 0.2572167752 by
first subtracting the low value of B, 0.2, giving
0.0572167752.
Then divided by the width of the range of ‘B’, 0.1.
This gives a value of 0.572167752.

26
Decoding(Cont………..)
Then calculate where that lands, which is in the range
of the next letter, ‘I’.
The process repeats until 0 or the known length of the
message is reached.

27
Arithmetic Decoding Algorithm-
Decoding algorithm:
r = input_number
repeat
search c such that r falls in its range
output(c) ;
r = r - low_range(c);
r = r ÷ (high_range(c) - low_range(c));
until EOF or the length of the message is reached

28
r c Low High range
0.2572167752 B 0.2 0.3 0.1
0.572167752 I 0.5 0.6 0.1
0.72167752 L 0.6 0.8 0.2
0.6083876 L 0.6 0.8 0.2
0.041938 ^(space) 0.0 0.1 0.1
0.41938 G 0.4 0.5 0.1
0.1938 A 0.2 0.3 0.1
0.938 T 0.9 1.0 0.1
0.38 E 0.3 0.4 0.1
0.8 S 0.8 0.9 0.1
0.0
29
Arithmetic Coding Summary
In summary, the encoding process is simply one of
narrowing the range of possible numbers with every
new symbol.
The new range is proportional to the predefined
probability attached to that symbol.
Decoding is the inverse procedure, in which the range
is expanded in proportion to the probability of each
symbol as it is extracted.

30
Continue…………………..
Coding rate approaches high-order entropy
theoretically.
Not so popular as Huffman coding because × , ÷ are
needed.

31
Run Length Encoder/Decoder
What is RLE?
Compression technique
Represents data using value and run length
Run length defined as number of consecutive
equal values
e.g
RLE
1110011111 130215

Val Run
ues Length
s
Advantage of RLE-
Useful for compressing data that contains repeated
values
e.g. output from a filter, many consecutive values are
0.
Very simple compared with other compression
techniques
Reversible (Lossless) compression
decompression is just as easy
Applications-
I Frame compression in Video-

Run Length E
RLE Effectiveness-
Compression effectiveness depends on input
Must have consecutive runs of values in order to
maximize compression
Best case: all values same
Can represent any length using two values
Worst case: no repeating values
Compressed data twice the length of original!!
Should only be used in situations where we know for
sure have repeating values
Encoder - Algorithm
 Start on the first element of input
 Examine next value
If same as previous value
Keep a counter of consecutive values
Keep examining the next value until a different value or
end of input then output the value followed by the
counter. Repeat
If not same as previous value
Output the previous value followed by ‘1’ (run length.
Repeat
Encoder – Matlab Code
% Run Length Encoder
% EE113D Project

function encoded = RLE_encode(input)

my_size = size(input);
length = my_size(2);

run_length = 1;
encoded = [];

for i=2:length
if input(i) == input(i-1)
run_length = run_length + 1;
else
encoded = [encoded input(i-1) run_length];
run_length = 1;
end
end

if length > 1
% Add last value and run length to output
encoded = [encoded input(i) run_length];
else
% Special case if input is of length 1
encoded = [input(1) 1];
end
Encoder – Matlab Results
>> RLE_encode([1 0 0 0 0 2 2 2 1 1 3])
ans =
1 1 0 4 2 3 1 2 3 1

>> RLE_encode([0 0 0 0 0 0 0 0 0 0 0])


ans =
0 11

>> RLE_encode([0 1 2 3 4 5 6 7 8 9])

ans =
0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9
1
Encoder
 Input from separate .asm file
In the form of a vector
e.g. ‘array .word 4,5,5,2,7,3,6,9,9,10,10,10,10,10,10,0,0’
 Output is declared as data memory space
Examine memory to get output
Originally declared to be all -1.
 Immediate Problem
Output size not known until run-time (depends on input
size as well as input pattern)
Cannot initialize variable size array
Encoder
Solution
Limit user input to preset length (16)
Initialize output to worst case (double input length – 32)
Initialize output to all -1’s (we’re only handling positive
numbers and 0 as inputs)
Output ends when -1 first appears or if length of output
equals to worst case
Decoder – Matlab Code
% Run Length Decoder
% EE113D Project
% The input to this function should be the output from Run Length Encoder,
% which means it assumes even number of elements in the input. The first
% element is a value followed by the run count. Thus all odd elements in
% the input are assumed the values and even elements the run counts.
%
function decoded = RLE_decode(encoded)

my_size = size(encoded);
length = my_size(2);

index = 1;
decoded = [];
% iterate through the input
while (index <= length)
% get value which is followed by the run count
value = encoded(index);
run_length = encoded(index + 1);
for i=1:run_length
% loop adding 'value' to output 'run_length' times
decoded = [decoded value];
end
% put index at next value element (odd element)
index = index + 2;
end
Decoder – Matlab Results
>> RLE_decode([0 12])
ans =
0 0 0 0 0 0 0 0 0 0 0 0

>> RLE_decode([0 1 1 1 2 1 3 1 4 1 5 1])


ans =
0 1 2 3 4 5

>> RLE_decode(RLE_encode([0 0 3 1 4 4 5 6 10]))


ans =
0 0 3 1 4 4 5 6 10
Reference:-
1. http://en.wikipedia.org/wiki/Entropy_encoding
2. www.cis.upenn.edu/~matuszek/cit594-2002/Slides/huffman.ppt
3. is.cs.nthu.edu.tw/course/2012Spring/ISA530100/chapt06.ppt
4. ihoque.bol.ucla.edu/presentation.ppt

Das könnte Ihnen auch gefallen