Beruflich Dokumente
Kultur Dokumente
Shibuya
Advanced Algorithms:
Text Compression Algorithms
Tetsuo Shibuya
A 32%
C 22%
Model
G 15%
ATGCCGTAATAACGATCAATACTAGCC...
T 31% Output
Information and Entropy
Advanced Algorithms / T. Shibuya
Property of information
Non-negative Japan wins the world cup: 0.1%
Spain wins the world cup: 15%
Larger if the probability is smaller
Which is surprise?
Information values should be addable
I(p·q) = I(p) + I(q) (in case the two events are independent to each other)
Information
I(p) = - log2(p)
Unit: bit entropy
Statistical methods
Based on the statistics of characters/words in the target
text
Huffman code
Tunstall
Golomb
Arithmetic code
PPM
...
Dictionary-based methods
Based on some frequent word dictionary
LZ77, LZ78, LZW
Block sorting
Sequitur
...
Huffman Code (1)
Advanced Algorithms / T. Shibuya
Idea
Smaller code for more frequent characters
Model Code
A 12.5% 100
C 12.5% 101 Expected to be 1.75n bit!
G 25% 11 (It's ideal!)
T 50% 0
Alphabet size = 4
log (Alphabet size) = 2bit
entropy = 1.75bit
Ideal Case
Huffman Code (2)
Advanced Algorithms / T. Shibuya
F E A
57
49 47 44
0 1
J B frequency
36 21
A pair of characters with lowest frequencies
Huffman Code (3)
Advanced Algorithms / T. Shibuya
Properties
Prefix code (Instantaneous code)
A code is not a prefix of other code
Optimal algorithm for coding characters with 0/1
Entropy:
H(X)≦L<H(X)+1
Extension
Consider a set of consecutive k characters as ONE
character
H(X)≦L<H(X)+1/k AAAA 0.47%
AAAC 0.37%
AAAG 0.35%
AAAT 0.41%
AACA 0.32%
...
Adaptive Coding (Feller-Gallager-Knuth)
Advanced Algorithms / T. Shibuya
0
Correct the Correct the
new tree tree
a b a b
000 101
a b a b
011
a b
001 010
Run-length coding
0101000110000010100000000100100010001
1 1 3 0 5 1 8 2 3 3
Golomb coding
Text Frequency
00000 121
0001 45 Huffman Coding
001 21
01 15
1 10
Arithmetic Coding (1)
Advanced Algorithms / T. Shibuya
0.0 1.0
01101
Arithmetic Coding (2)
Advanced Algorithms / T. Shibuya
Idea
Any text = subregion of [0, 1).
Divide [0,1) according to the text
More frequent character = larger region (in proportion to the
probability)
Remember a floating number (with the shortest 0-1
sequence) in the region and the number of characters in the
original text
a b
aa ab
aba abb
abaa abab
0111
011
0.0 1.0
Decoding Arithmetic Codes
Advanced Algorithms / T. Shibuya
Three steps
Comparison of the real number with thresholds
Output a character
Scaling the target region to [0.1)
a b
aa ab
aba abb
abaa abab
0111
011
0.0 1.0
Adaptive Arithmetic Coding
Advanced Algorithms / T. Shibuya
A C G T
1 2 1 1
1 2 2 1
1 3 2 1
PPM (Prediction by Partial Matching)
Advanced Algorithms / T. Shibuya
Frequency Table
JBIG1: Binary image compression
Advanced Algorithms / T. Shibuya
1 2 3
4 5 6 7 8
9 10
Arithmetic Coding
LZ77 (Ziv-Lempel '77)
Advanced Algorithms / T. Shibuya
Dictionary-based compression
Coding by previously appearing words
The position and the length
Computation
Ideally, use suffix trees/arrays
とてもまともなとまとももともとまともなとまとではなかった
4,2
12,2
4,7
+Huffman coding→LZH,zip,PNG
+Shanon-Fano coding→gzip
LZ78 (Ziv-Lempel '78)
Advanced Algorithms / T. Shibuya
とてもまともなとまとももともとまともなとまとではなかった
4, も 6, ま 3, も 4, と 6, と 8, な 5, と 9, か
LZW (Welch)
Advanced Algorithms / T. Shibuya
とてもまともなとまとももともとまともなとまとではなかった
1 2 3 4 1 3 5 1 14 3 3 15 18 15 17 14 6 7 5 9 10 8
11 13 15 17 19 21 23 25 27 29 31
12 14 16 18 20 22 24 26 28 30
Implicit table
Four steps
Divide the text into blocks, and do the following for each block
So that we can decode faster
BWT (Burrows-Wheeler Transform)
Related to the suffix arrays
MTF(move to front) coding
Huffman coding or arithmetic coding
0: mississippi$ 10: i$
1: ississippi$ 7: ippi$
2: ssissippi$ 4: issippi$
3: sissippi$ 1: ississippi$
4: issippi$ 0: mississippi$
cf. 5: ssippi$ 9: pi$
6: sippi$ Sort 8: ppi$
7: ippi$ 6: sippi$
8: ppi$ 3: sissippi$
9: pi$ 5: ssippi$
10: i$ 2: ssissippi$
Suffix Array
BWT (Burrows-Wheeler Transform)
Advanced Algorithms / T. Shibuya
BWT SA
10: i 11: $mississippi
9: p 10: i$mississipp
6: s 7: ippi$mississ
3: s 4: issippi$miss
0: m 1: ississippi$m
11: $ 0: mississippi$
mississippi$ 8: p 9: pi$mississip
decode 7: i 8: ppi$mississi
5: s 6: sippi$missis
2: s 3: sissippi$mis
4: i 5: ssippi$missi
1: s 2: ssissippi$mi
BWT Decoding
Advanced Algorithms / T. Shibuya
BWT
i $ i$ $m i$m $mississippi
p i pi i$ pi$ i$mississipp
s i si ip sip ippi$mississ
s i si is sis issippi$miss
m i mi is mis ississippi$m
$ m $m mi $mi mississippi$
p p pp pi ppi pi$mississip
Repeat!
i Sort p add ip Radix pp add ipp ppi$mississi
s s BWT ss Sort si BWT ssi sippi$missis
s s ss si ssi sissippi$mis
i s is ss iss ssippi$missi
i s is ss iss ssissippi$mi
Still searchable!
Compression ratio is in proportion to the entropy!
0: mississippi$ 10: i$
1: ississippi$ 7: ippi$
2: ssissippi$ 4: issippi$
3: sissippi$ 1: ississippi$
4: issippi$ 0: mississippi$
5: ssippi$ 9: pi$
6: sippi$ Sort 8: ppi$
7: ippi$ 6: sippi$
8: ppi$ 3: sissippi$
9: pi$ 5: ssippi$
10: i$ 2: ssissippi$
Suffix Array
Ψ (Psi) Array
Advanced Algorithms / T. Shibuya
Remember
Which is the longest suffix
Number of each character in the text
Index / Suffix array / suffixes Ψ
0:11: $
1:10: i$ 0
2: 7: ippi$ 7
i 3: 4: issippi$ 10
4: 1: ississippi$ 11
m 5: 0: mississippi$ 4 Start from here
6: 9: pi$ 1
p 7: 8: ppi$ 6
8: 6: sippi$ 2
s 9: 3: sissippi$ 3
10:5: ssippi$ 8
11:2: ssissippi$ 9
Compressing Ψ (1)
Advanced Algorithms / T. Shibuya
Original: 3 4 7 8 13 24 26 32 37 45 54 61 70 ...
Difference: - 1 4 1 5 9 2 6 5 8 9 7 9 ...
Searching on Ψ
Advanced Algorithms / T. Shibuya